Compositions of self-reporting transposon (srt) constructs and methods for mapping transposon insertions

ABSTRACT

Among the various aspects of the present disclosure is the provision of compositions and methods for mapping transposon insertions. Applications can include mapping the locations of self-reporting transposons (SRTs) from thousands of single cells in parallel, while simultaneously measuring mRNA abundance from the same single cells; analyzing genome-associated protein (GAP) (e.g., transcription factor) binding/interactions in a small number of cells in bulk, without single cell resolution; lineage tracing; or as an improved readout for transposon mutagenesis screens.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority from U.S. Provisional Application Ser.No. 62/777,995 filed on 11 Dec. 2018, which is incorporated herein byreference in its entirety.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

This invention was made with government support under NS076993,MH109133, HG009750, HG009986, and MH017070 awarded by the NationalInstitutes of Health. The government has certain rights in theinvention.

MATERIAL INCORPORATED-BY-REFERENCE

The Sequence Listing, which is a part of the present disclosure,includes a computer readable form comprising nucleotide and/or aminoacid sequences of the present invention. The subject matter of theSequence Listing is incorporated herein by reference in its entirety.

FIELD OF THE INVENTION

The present disclosure generally relates to self-reporting transposon(SRT) constructs and mapping of transposon insertions.

SUMMARY OF THE INVENTION

Among the various aspects of the present disclosure is the provision ofcompositions and methods for mapping transposon insertions.

An aspect of the present disclosure provides for a self-reportingtransposon (SRT) construct comprised of a transposon comprising at leastone promoter element wherein the promoter is capable of drivingtranscription of RNA through at least one transposon end after the SRTconstruct is inserted into genomic DNA, so that a portion of thetransposon DNA, at least one transposon end, and the genomic DNAflanking the transposon end is transcribed into RNA.

Another aspect of the present disclosure provides for a method for theinsertion of an SRT construct into a cellular genome wherein the SRTconstruct and either a (i) transposase capable of cutting or copying thetransposon out of the transposon construct and pasting into genomic DNA,or (ii) a genome-associated protein (e.g. a transcription factor, achromatin reader, writer, or eraser), hereafter referred to as GAP,operably linked to a transposase capable of cutting the transposon outof the transposon construct and pasting into genomic DNA are deliveredto cells so that the transposase gene or genome-associatedprotein-transposase fusion is expressed, or can be induced, afterdelivery to the cells. The SRT construct and transposase gene can bedelivered simultaneously or sequentially.

Yet another aspect of the present disclosure provides for a plasmidencoding an SRT construct as described herein.

In some embodiments, the promoter is capable of transcribing the 3′region flanking a transposon end comprising a transposon terminalrepeat, resulting in an RNA transcript, wherein the RNA transcript isterminated by a cryptic poly-adenylation (poly-A) signal or picks up apoly-A stretch in the genome such that the transcript can be recoveredby reverse transcription using a poly-T primer.

In some embodiments, the promoter is an inducible promoter.

In some embodiments, the inducible promoter is capable of being inducedby a chemical inducer, light, or excision of a stop codon orpoly-adenylation signal.

In some embodiments, the promoter is selected from the group consistingof, but not restricted to, an EF1α promoter, CAG promoter, PGK promoter,Tet-on or Tet-off promoter, a T7 promoter, or a CMV promoter.

In some embodiments, the promoter drives expression of a reporter geneincorporated in the transposon.

In some embodiments, the reporter gene is selected from the groupconsisting of: a gene encoding a fluorescent protein; a gene capable ofuse as a selectable marker by conferring resistance to a chemical agentthat kills eukaryotic or prokaryotic cells; or an enzyme capable ofconverting a chemical substrate into a colorimetric, luminescent, orfluorescent reporter; and combinations thereof.

In some embodiments, the gene encoding a fluorescent protein is selectedfrom, but not restricted to, the group consisting of green fluorescentprotein, tdTomato, eGFP, and eCFP.

In some embodiments, the gene capable of use as a selectable marker byconferring resistance to a chemical agent that kills eukaryotic orprokaryotic cells is selected from the group consisting of, but notrestricted to, puromycin N-acetyl-transferase, providing resistance topuromycin; either of two aminoglycoside 3′ phosphotransferase genesencoded by Tn5 and Tn601 (i.e., a neo gene), providing resistance toG418; or hygromycin phosphotransferase, providing resistance tohygromycin.

In some embodiments, the enzyme capable of converting a chemicalsubstrate into a colorimetric, luminescent, or fluorescent reporterselected from the group consisting of, but not restricted to,beta-galactosidase or beta-lactamase, cleaving x-gal, and GeneBLAzer,respectively.

In some embodiments, the RNA transcript produced by the promoter doesnot encode a splice donor site between the promoter and the transposonend.

In some embodiments, the transposon does not comprise a poly-adenylation(poly-A) termination signal between the promoter and the transposon end.

In some embodiments, the DNA comprising the transposon encodes aself-cleaving ribozyme such as, but not restricted to, the hammerheadribozyme immediately downstream of the transposon end.

In some embodiments, the construct encodes a bacterial or eukaryoticribosomal RNA sequence (e.g. 5S, 18S) downstream of the transposon end.

In some embodiments, the transposon encodes a Woodchuck Hepatitis Virus(WHP) Posttranscriptional Regulatory Element (WPRE).

In some embodiments, the genome-associated protein is a transcriptionfactor, a general transcriptional mediator, or a chromatin reader,writer, or eraser.

In some embodiments, the genome-associated protein and the transposaseare separated by a peptide linker.

In some embodiments, the DNA-binding protein is selected from the groupconsisting of, but not restricted to, Brd4, Sp1, Hb9, Olig2, Ngn2, Medl,Creb, p53, Usf1, or FoxA2.

In some embodiments, the genome comprising: (i) introducing aself-reporting transposon (SRT) into one or more cells; (ii) introducingDNA encoding a transposase into the same cells; (iii) allowing thetransposase to direct the insertion of SRTs into cellular genomes; (iv)mapping the locations of SRT insertions using cellular mRNA. In someembodiments, the genomic locations of protein-genome interactions, whichcan include direct binders of DNA (e.g., transcription factors) orDNA-associated proteins (e.g., chromatin readers and remodelers), can beidentified by: (a) introducing DNA encoding a fusion of a GAP andtransposase in step (ii); (b) determining the locations of transient orstable GAP-DNA interactions from the aggregated SRT location data, afterstep (iv).

In some embodiments, the SRT construct and the gene encoding thetransposase (or the GAP-transposase fusion) are delivered to the cellwith known methods of gene delivery including, but not restricted to:(i) electroporation; (ii) lipofection; (iii) viral delivery (e.g.lentivirus, adenovirus, herpes simplex virus, adeno-associated virus,rabies virus); (iv) micro-injection; (v) sono-poration; or (vi)magnetofection.

In some embodiments, the SRT construct is encoded on plasmid DNA.

In some embodiments, the transposase gene or GAP-transposase fusionprotein is encoded on plasmid DNA.

In some embodiments, the transposase or GAP-transposase fusion isdelivered to the cell via mRNA or directly as a protein, using knownmethods for mRNA or protein delivery including, but not restricted to:(i) electroporation; (ii) lipofection; (iii) viral delivery (e.g.integration deficient lentivirus); (iv) micro-injection; (v)sono-poration; or (vi) magnetofection.

In some embodiments, the transposase is delivered to the cell byengineering its genome using known methods (e.g. homologousrecombination, Cas9 mediated homologous recombination). The genome canbe engineered so that the unfused transposase protein is expressed or sothat one or more copies of an endogenous gene encoding a GAP fusion isfused to the transposase, optionally with a sequence encoding a peptidelinker.

In some embodiments, one or more copies of the SRT is delivered to thecell by engineering its genome using known methods (e.g. pro-nuclearinjection, homologous recombination, Cas9 mediated homologousrecombination).

In some embodiments, (i) the transposon is selected from the DDE familyof transposons (e.g., piggyBac transposon, Sleeping Beauty transposon,Ty5 transposon), the rolling circle/Y2 family of transposons (e.g.,helitrons), or the TP-retrotransposon family (e.g., LINE-1retrotransposon); (ii) the transposase used to insert the transposoninto the cellular genome is selected from the DDE family of transposases(e.g., piggyBac, Sleeping Beauty, Ty5), the rolling circle/Y2 family oftransposons (e.g., Helraiser), or the TP-retrotransposon family (e.g.,LINE-1), or any hyperactive variants of these transposases; and (iii)the transposase employed corresponds to the chosen transposon

In some embodiments, the transposase or GAP-transposase fusion is fusedto a destabilized domain (DD) so that transposition of SRTs can beinduced by a small molecule such as, but not restricted to: Shield-i,FK506, rapamycin, trimethoprim, or tamoxifen.

In some embodiments, the DD is selected from the group consisting of,but not restricted to, a mutant of FKBP12, E. coli dihydrofolatereductase, or an estrogen receptor protein.

Yet another aspect of the present disclosure provides for a method formapping the insertion locations of SRTs that have been transposed intothe DNA of a plurality of cells (e.g., 10²-10⁸ cells) by, (i) harvestingtotal RNA in bulk from the cells; (ii) reverse transcribing mRNA intocDNA using a poly-T primer tailed with a universal sequence (e.g., theSMART primer); (iii) PCR amplifying the cDNA using a primer specificeither for the transposon end or the reporter gene and a primer(optionally biotinylated) specific to the universal sequence; (iv)tagmenting the PCR product (e.g., using a Nextera kit); (v) amplifyingusing the transposon end primer and tagmentation primers suitable foramplifying the PCR product encoding the junction between the insertedtransposon and the genome (e.g., a Nextera primer); and (vi) sequencingthe tagged DNA fragments by employing 2^(nd) or 3^(rd) generationsequencing technology (e.g., Illumina or PacBio) using sequencingprimers that are designed so that the transposon-genome junction issequenced.

In some embodiments, unwanted transposon sequence from the PCR product(e.g., as can be the case when mapping LINE-1 retrotransposons) isremoved after step (iii) by, (a) capturing the biotinylated PCR producton streptavidin-coated magnetic beads; (b) optionally, tailing the endsof the PCR product with a dideoxynucleotide (ddNTP); (c) incubating thePCR products in vitro with Cas9 and guide RNAs (gRNAs) to specificallycut the unwanted transposon sequence; (d) end repairing, A-tailing, andligating Y-adapters to the cut PCR products “on bead”; (e) amplifying byPCR with a primer specific for the Y-adapter and a primer specific tothe universal sequence; (f) purifying the resulting PCR product andproceeding with step (iv). In some embodiments, (i) the cDNA issequenced directly after step (iii) using 2^(nd) or 3^(rd) generationsequencing technology (e.g., Illumina NextSeq, Pacific Biosciences,Oxford Nanopore) with sequencing primers that are designed so that thetransposon-genome junction is sequenced, and/or (ii) the reversetranscription of claim 32 or 33, step (ii) is performed using shortrandom primers of 6 to 10 nucleotides tailed with a universal sequence(e.g., the SMART primer), and/or (iii) the PCR product produced in claim32 or 33, step (vi) is sequenced using known methods for the highthroughput sequencing of PCR products (e.g., shearing by sonication andthe ligation of Illumina adapters).

Yet another aspect of the present disclosure provides for a method formapping the genomic locations of SRTs that have been inserted into DNAof a plurality of cells (e.g., 10²-10⁸ cells) using a bacteriophage(e.g. T7, T3, SP6) promoter, comprising: (i) introducing into cells anSRT with a bacteriophage promoter in the transposon (as opposed to aeukaryotic promoter, such as EF-1α); (ii) harvesting cellular DNA; (iii)shearing the DNA and ligating onto fragments a Y-linker comprising auniversal primer, or tagmenting the DNA (e.g., using a Nextera kit);(iv) performing an in vitro transcription reaction; (v) performing firststrand synthesis using a universal priming sequence; (vi) amplifying byPCR, using a universal primer (optionally biotinylated) and a primertargeting the transposon end; and (vii) optionally, removing unwantedtransposon sequence from the PCR products by following the methoddescribed herein.

In some embodiments, the primers used in the final step are tailed withIllumina P5 and P7 (and optionally Illumina Seq1 and Seq2) sequences andthe amplification product is loaded on an Illumina sequencer andsequenced.

Yet another aspect of the present disclosure provides for a method ofmapping the locations of SRTs and simultaneously measuring mRNAabundance from a plurality of single cells in parallel, so thatthousands or tens of thousands of cells can be analyzed in oneexperiment, composed of the following steps: (i) Converting the mRNAfrom single cells into cDNA that is labeled at its 3′ end with a cellbarcode, a unique molecular index (UMI), and a universal primingsequence using known methods for single-cell RNA-seq such as 10×Chromium, Drop-Seq, or InDrop. All of the mRNA molecules from a singlecell will be tagged with the same cell barcode, but different UMIs; (ii)Separating the pooled cDNA from all cells analyzed in the experimentinto two fractions; (iii) Recovering the transcriptomes of the singlecells by completing the single cell RNA-seq protocol chosen in step (i)using one of the cDNA fractions; (iv) Mapping the genomic locations ofSRTs and assigning the SRTs to cell barcodes by circularizing SRT cDNAto physically bring the cell barcode in apposition to the insertion siteand then performing Illumina sequencing; This is achieved by (a)amplifying SRT cDNA by performing PCR with primers that bind to theuniversal priming sequence next to the cell barcode and the transposonend, wherein, these primers are biotinylated and carry a 5′ phosphategroup; (b) optionally, removing unwanted transposon sequence from thePCR products described above; (c) diluting the PCR products of thisamplification and performing a self-ligation reaction; (d) shearing theself-ligated products and capturing the ligation junction and flankingsequences by pulling these fragments down with streptavidin-coatedmagnetic beads; (e) preparing this fragment for Illumina sequencing byperforming end repair, A-tailing, and adapter ligation “on bead”; (f)performing a final PCR step to add the required Illumina sequences forhigh-throughput sequencing. The standard Illumina read 1 primer willanneal and read the cell barcode and UMI, while a custom read 2 primer,annealing to the end of the transposon, reads into the genome; and/or(g) analyzing the Illumina sequencing reads to collect both the locationof each SRT insertion as well as the cell barcode corresponding to itscell of origin.

Yet another aspect of the present disclosure provides for acomputational method for ascribing SRT insertions to specific cell typescomposed of the following steps: (i) identifying cell types from thescRNA-seq library from claim 37, step (iii) using known methods ofscRNA-seq analysis (e.g., Seurat, scanpy); (ii) isolating the sets ofcell barcodes comprising each cell type; (iii) using the celltype-specific barcodes to filter the barcoded single cell SRT insertionsisolated in claim 37, step (iv); and (iv) optionally, determining celltype-specific locations of transient or stable GAP-DNA interactions fromthe aggregated barcoded SRT location data.

In some embodiments, the single-cell RNA-seq methodology used includes,but is not restricted to, the following platforms: 10× GenomicsChromium, Drop-seq, Fluidigm, InDrop, MARS-seq, SCI-seq, SPLiT-seq,Microwell-seq.

In some embodiments, the transposon system used includes, but is notrestricted to, the following: DDE transposons (e.g., piggyBactransposons, Sleeping Beauty transposons), rolling circle/Y2 transposons(e.g., helitron transposons), or TP-retrotransposons (e.g., LINE-1); andthe respective transposase.

Yet another aspect of the present disclosure provides for acomputational method for filtering out intermolecular ligations in theanalysis of single cell calling cards, comprising: (i) mapping the 5′ends (without cellular barcode information) and the 3′ ends (withcellular barcode information but with imprecise information about theinsertion site); and (ii) verifying that the location of the 3′ end mapsnearby the 5′ end on the genome (e.g., less than 5 kb).

Yet another aspect of the present disclosure provides for acomputational method for filtering out intermolecular ligations in theanalysis of single cell calling cards by requiring that a given 5′-3′pair in the circularization data is represented by at least 2 uniquemolecular identifiers (UMIs).

In some embodiments, (a) if a self-cleaving ribozyme is present, theself-cleaving ribozyme degrades all mRNA produced by uninsertedtransposons (e.g., on donor plasmids); (b) if a ribosomal RNA (e.g., 5S,18S) sequence is present, the ribosomal RNA sequence marks RNAtranscribed from donor plasmids, and sequence contamination is removedusing a bacterial or eukaryotic ribosomal RNA depletion kit; or (c) if aWoodchuck hepatitis posttranscriptional regulatory element (WPRE) ispresent, the WPRE stabilizes the mRNA molecule.

Yet another aspect of the present disclosure provides for a method ofcellular lineage tracing, comprising: (i) introducing into one or morecells an undirected transposase that can integrate at many differentlocations in the genome described herein; (ii) introducing into the samecells the SRT described herein; (iii) allowing the cells to divideand/or differentiate while they express the transposase so that SRTs areinserted into different genomic locations at different times along acellular lineage, so that each SRT insertion serves as alineage-specific barcode; (iii) mapping the location of SRT insertionsand mRNA transcriptomes from single cells as described above; and (iv)using single cell transcriptomes to ascertain, for each of a pluralityof cells, information about the cell's identity and genealogicalrelationships between these cells using information about the number ofshared SRT insertion events between cells.

In some embodiments, the method provides for identifying novel celltypes and reconstructing their genealogical context.

Yet another aspect of the present disclosure provides for a method ofreading out transposon mutagenesis screens, comprising: (i) introducinginto cells an SRT as a mutagen as described herein; and (ii) reading outthe transposon insertions as described herein.

In some embodiments, the method is capable of measuring thetranscriptome, genome, methylome, chromatin accessibility, cell-surfacemarkers, or combinations thereof.

In some embodiments, genomic locations of inserted transposons can bemapped from either mRNA, wherein the use of mRNA enables both higherefficiency and compatibility with single-cell transcriptomics.

Other objects and features will be in part apparent and in part pointedout hereinafter.

DESCRIPTION OF THE DRAWINGS

Those of skill in the art will understand that the drawings, describedbelow, are for illustrative purposes only. The drawings are not intendedto limit the scope of the present teachings in any way.

FIG. 1A-FIG. 1B is a schematic depicting self-reporting calling cards.(A) By fusing a piggyBac transposase (PBase) to a transcription factor(TF), the TF is endowed with the ability to insert transposon DNA intothe genome. B) The transposon contains an EF1α promoter that drivestranscription into the genome, which is eventually terminated by acryptic or endogenous poly-adenylation signal sequence in the genome.Thus, the transposon reports its location via mRNA.

FIG. 2A-FIG. 2B is a schematic depicting how the ribozyme preventsunwanted donor recovery. (A) A self-cleaving ribozyme adjacent to thetransposon TR causes transcripts generated from the donor plasmid to bedegraded. Also, the lack of a poly-A tail prevents recovery by reversetranscription. (B) After insertion into the genome, transcripts arestable and are poly-A tailed or contain genomic poly-A sequences.

FIG. 3A-FIG. 3I is a schematic depicting the process of reading outself-reporting calling cards in a Drop-seq experiment.

FIG. 4 describes single cell calling cards accurately mappingbromodomain binding. Calling Cards collected from a mixture of humanHct-116 and mouse N2a cells accurately map Brd4 binding. The top fourpanels show Hctl 16 calling cards and insertion density and N2a callingcards and insertion density. Below that is Brd4 binding and H3K27ac asdetermined by ChIP-Seq in bulk.

FIG. 5A-FIG. 5E is a series of schematics and graphs showing thatself-reporting transposons (SRTs) are mapped more efficiently from RNAover DNA and, when directed SP1-PBase, identify SP1 binding sites. (A)Schematic of a self-reporting piggyBac transposon with puromycin marker(PB-SRT-Puro) and undirected (PBase) and SP1-directed (SP1-PBase)piggyBac transposases. SRTs are constructed by removing thepolyadenylation signal sequence between the end of the marker gene andthe 5′ terminal repeat (TR). A self-cleaving ribozyme (Rz) on thedelivery vector, downstream of the SRT, prevents recovery of plasmidtransposons. (B) SRTs are mapped by reverse transcribing RNA with apoly(T) primer followed by a series of nested PCRs and tagmentation.This final library is enriched for the junction between the transposonand the genome. (C) RNA-based recovery of SP1-directed SRTs in HCT-116cells is more efficient than DNA-based recovery. The RNA protocolrecovers 80% of the same insertions as the DNA protocol and recoverstwice as many insertions overall. (D) The distribution of insertionswith respect gene annotation is identical between transposons recoveredby DNA and by RNA. (E) Insertions deposited by SP1-PBase show pronouncedand specific clustering at SP1 ChIP-seq peaks over insertions left byundirected PBase. In the calling card track, each circle represents anindependent insertion. Genomic position is on the x-axis and the numberof reads supporting that insertion is on the y-axis on alog₁₀-transformed scale. The density tracks show the local density ofinsertions in each experiment, normalized for library size.

FIG. 6A-FIG. 6F is a series of graphs, scatter plots, and a heat mapshowing undirected piggyBac (PBase) insertions mark BRD4-boundsuper-enhancers. (A) Undirected PBase insertions are distributednon-randomly, with increased density overlapping BRD4-bound chromatinand H3K27 acetylated histones. Also shown are BRD4-bound super-enhancers(SEs). (B) PBase peak calls are highly replicable, with biologicalreplicates showing high concordance of normalized insertions at peaks.(C) PBase peaks show central enrichment for BRD4 ChIP-seq signal. Thesefindings are statistically significant when compared to a genome-widepermutation of PBase peaks (p<10⁻⁹, KS test). (D) PBase peaks arecentrally enriched for the histone modifications H3K27ac and H3K4me1,marks associated with enhancers. These same peaks show mild depletionfor H3K9me and H3K27me, marks canonically associated with repressedchromatin. (E) Receiver-operator characteristic curve for SE detectionusing PBase insertions. (F) Precision-recall curve for SE detectionusing PBase insertions. IPM: insertions per million mapped insertions;AUROC: area under receiver-operator curve; AUPRC: area underprecision-recall curve; KS: Kolmogorov-Smirnov; FC: fold change.

FIG. 7A-FIG. 7F is a series of schematics, scatter plots, and graphsshowing single cell calling cards (scCC) maps BRD4 binding and SP1 insingle cells. (A) Schematic of the sCC library preparation strategy fromscRNA-seq libraries. Self-reporting transcripts are amplified usingbiotinylated primers and circularized, which brings the cell barcode andunique molecular index (UMI) in close proximity to the transposon-genomejunction. Circularized molecules are sheared, captured withstreptavidin, and Illumina adapters are ligated. Custom sequencingyields the cell barcode and UMI with read 1 and the genomic insertionsite with read 2. (B) Barnyard plot of HCT-116 and N2a cells transfectedwith SRTs shows clean segregation of cell types. Most cells wereassigned either human insertions or mouse insertions, with a minority(7.8%) containing insertions from both species. (C) Human HCT-116 andK562 cells were transfected with PB-SRT-Puro and HyPBase andsubsequently subjected to scRNA-seq. Two clear cell types emergerevealing each constituent cell population. (D) scCC deconvolves HyPBaseinsertions from HCT-116 and K562 cells, identifying shared and specificBRD4 binding sites. (E) scCC on HCT-116 cells transfected withSP1-HyPBase identifies SP1 binding sites. (F) SP1-HyPBase peaks fromscCC data show strong central enrichment for SP1 ChIP-seq signal.

FIG. 8 is a schematic of PB-SRT-tdTomato, an SRT compatible with in vivoexperiments. The pre-transposition tdTomato transcript (left) isdegraded by the downstream ribozyme (Rz), leading to low fluorescenceintensity. After transposition into the genome, the self-reportingtranscript is stabilized and results in a bright signal.

FIG. 9A-FIG. 9D is a series of schemes and graphs showing properties ofself-reporting transposons. (A) Analysis of bulk RNA calling cardlibraries prepared from HEK293T cells transfected with PB-SRT-tdTomatowith and without HyPBase transposase. The transposase is required forefficient and complex library generation. (B) Technical replication ofbulk RNA calling cards from HCT-116 cells transfected with PB-SRT-Puroand SP1-PBase. Over 80% of insertions in each trial were shared betweenboth replicates. (C) No significant differences were observed betweenDNA- and RNA-based recovery of SP1-directed insertions with respect tochromatin state in HCT-116 cells. (D) The self-cleaving ribozymeeliminates recovery of un-excised transposons when calling cardlibraries are prepared from RNA but not DNA.

FIG. 10A-FIG. 10B is a series of graphs showing piggyBac, SP1-piggyBacfusions, and Sleeping Beauty display different local transposition ratesdepending on chromatin state. (A) Chromatin state analysis on localrates of transposition. Undirected and SP1-directed piggyBactransposases show different preferences for chromatin states. UndirectedpiggyBac favors promoters and enhancers, while SP1-piggyBac fusions showmarked preference for promoters. Sleeping Beauty shows uniformdistribution of insertions across all chromatin states. (B) Same data as(A) but with different x-axes for each graph. IPM: insertions permillion mapped insertions; kb: kilobase.

FIG. 11A-FIG. 11E is a series of graphs showing SP1 fused to piggyBacredirects insertions to SP1 binding sites. (A) SP1 peaks show highreproducibility between biological replicates. Each circle represents apeak; x and y coordinates represent normalized insertions in eachbiological replicate. (B) Mean SP1 ChlIP-seq profile across all SP1peaks shows strong central enrichment. (C) Heatmap of SP1 ChIP-seqsignal across all SP1 peaks, expressed as log₂ (FC) over the inputcontrol. (D) SP1-PBase shows enrichment of insertions to transcriptionstart sites (TSSs), CpG islands, and unmethylated CpGs, all knownbiological targets of SP1. Each enrichment was statistically significantat p<10⁻⁹ (G test of independence). (E) Motif discovery performed on SP1peaks shows good concordance with an orthogonally-derived SP1 motif.IPM: insertions per million mapped insertions. FC: fold change.

FIG. 12A-FIG. 12F is a series of graphs showing SP1 fused to hyperactivepiggyBac (SP1-HyPBase) also redirects insertions to SP1 binding sites.(A) SP1-HyPBase, like SP1-PBase, can also be used to identify SP1binding sites. (B) Insertions at SP1-HyPBase-derived peaks show highreproducibility between biological replicates. (C) Mean SP1 ChIP-seqprofile at peaks shows strong central enrichment. (D) Heatmap of SP1ChIP-seq signal across all peaks, expressed as log₂(FC) from the inputcontrol. (E) SP1-HyPBase redirects insertions to TSSs, CpG islands, andunmethylated CpGs (p<10⁻⁹, G test of independence). (F) Motif analysisof SP1-HyPBase peaks identifies the SP1 motif. IPM: insertions permillion mapped insertions; FC: fold change.

FIG. 13A-FIG. 13F is a series of graphs showing undirected hyperactivepiggyBac (HyPBase) insertions also mark Brd4-bound super-enhancers.(A)Undirected HyPBase, like PBase, shows non-uniform densities ofinsertions in BRD4-bound regions. (B) Densities of insertions arereproducible at HyPBase peaks. (C) Mean BRD4 ChIP-seq profile at HyPBasepeaks compared to randomly chosen peaks. The BRD4 enrichment issignificant with p<10⁻⁹ (KS test). (D) Undirected HyPBase peaks arestrongly correlated with H3K27ac and H3K4me1 and mildly anti-correlatedwith H3K9me and H3K27me3, consistent with these regions being enhancers.(E) Receiver-operator characteristic curve for detecting BRD4-bound SEswith undirected HyPBase peaks. (F) Precision-recall curve for detectingSEs with undirected HyPBase peaks. KS: Kolmogorov-Smirnov; FC: foldchange.

FIG. 14A-FIG. 14D is a series of graphs showing downsampling PBase andHyPBase insertions affects sensitivity to BRD4-bound super-enhancers.(A) Downsampling analysis BRD4-bound SE detection by PBase insertions atvarious p-value thresholds. (B) Downsampling analysis applied to HyPBaseinsertions. (C) Linear interpolation applied to (A) to predict SEsensitivity across a range of insertions. (D) Linear interpolationapplied to (B).

FIG. 15 is a graph showing examples of BRD4-bound super-enhancersidentified by PBase and HyPBase calling cards. Three different lociexhibiting non-uniform densities of piggyBac insertions. These densitiescorrelate well with BRD4 and H3K27ac ChIP-seq data. Density tracks areshown before and after smoothing. Sleeping Beauty does not show the samepreference for BRD4-bound regions as piggyBac but instead appearsuniformly distributed.

FIG. 16A-FIG. 16D is a series of plots showing filtering single cellSRTs reduces intermolecular artifacts. (A) Barnyard plot from scRNA-seqof HCT-116 and N2a cells shows clean resolution of cell types. Cellswere assigned as human or mouse if at least 80% of transcripts in eachcell mapped to hg38 or mm10, respectively, or a multiplet otherwise.3.2% of cells were classified as multiplets. (B) Barnyard plot from scCCof HCT-116 and N2a cells without filtering. 25.1% of cells were calledas multiplets. (C) Distribution of species purity from unfiltered scCCdata. The x-axis is the proportion of transcripts mapping to the humanor mouse genomes. (D) Distribution of species purity after filteringscCC data.

FIG. 17A-FIG. 17I is a series of graphs showing validation andperformance of in vitro single cell calling cards. (A) Expression ofthree marker genes (AKAP12, PRAME, XIST) identify HCT-116 (n=12,891) andK562 (n=11,912) cells from scRNA-seq libraries. (B) Distributions ofgenes per cell by cell type. (C) Distributions of transcripts per cellby cell type. The numbers of genes and transcripts detected per cellwere comparable between HCT-116 and K562 cells. (D) Distributions ofrecovered HyPBase insertions in HCT-116 and K562 cells. (E) Mean BRD4ChIP-seq signal at HyPBase peaks in HCT-116 cells compared to randomlypermuted peaks (p<10⁻⁹, KS test). (F) Mean BRD4 ChIP-seq signal atHyPBase peaks in K562 cells compared to randomly permuted peaks (p<10⁻⁹,KS test). (G) Distributions of recovered insertions in HCT-116 cellstransfected with PB-SRT-Puro and either HyPBase or SP1-HyPBase. (H)Reproducibility of normalized insertions deposited by HyPBase andrecovered by scCC at BRD4 binding sites in HCT-116 cells. (I)Reproducibility of normalized insertions deposited by SP1-HyPBase andrecovered by scCC at SP1 binding sites in HCT-116 cells. KS:Kolmogorov-Smirnov.

FIG. 18 is a schematic depicting the calling cards method adapted foruse with the L1 retrotransposon with DNA recovery. By fusing aretro-transposase to a transcription factor (TF), the TF is endowed withthe ability to insert retrotransposons into the genome. In thisembodiment, the TF is fused to ORF2 of the L1 transposon. ORF2 then cutsgenomic DNA at TF binding sites and copies the RNA transposon via itsreverse transcriptase activity to insert a DNA copy of the transposon inthe genome. Transposon locations can be mapped by performing inverse PCRfollowed by Illumina sequencing.

FIG. 19 is a genome wide view of LINE 1 insertions that were eitherdirected by SP1 or undirected. These insertions were mapped using theprotocol described in FIG. 18.

FIG. 20 (top panel) demonstrates that the we recovered as manyinsertions from the Sp1-retrotransposase fusion as with the unfusedretrotransposase, demonstrating that the fusion does not significantlyimpair retrotransposase activity. (bottom panel) The Sp1-retrotransposefusion deposits significantly more transposons into promoters, 5′ UTRsand CpG islands, consistent with Sp1's known binding preferences forthese regions of the genome. These data demonstrate that retrotransposoninsertions are significantly enriched near Sp1 binding sites.

FIG. 21 is a table and graph demonstrating that the Sp1 transcriptionfactor can redirect insertions of the L1 transposon, as Sp1 directedinsertions are enriched near Sp1 motifs.

FIG. 22 provides an example of a known Sp1 binding site, demonstratingthe concordance of ChIPseq signal, L1 calling cards directed by Sp1, andpiggyBac calling cards directed by Sp1.

FIG. 23 demonstrates that the L1 transposon can be controlled by a smallmolecule shield when a degradation domain (DD) is fused to ORF1.

FIG. 24 is a recovery protocol for a self-reporting L1 transposon whereunwanted transposon DNA is removed by Cas9 cutting. This is achieved byperforming the following steps:

-   -   (a) capturing the biotinylated PCR product on        streptavidin-coated magnetic beads    -   (b) optionally, tailing the ends of the PCR product with a        dideoxynucleotide (ddNTP)    -   (c) incubating the PCR products in vitro with Cas9 and guide        RNAs (gRNAs) to specifically cut the unwanted transposon        sequence    -   (d) end repairing, A-tailing, and ligating Y-adapters to the cut        PCR products “on bead”    -   (e)amplifying by PCR with a primer specific for the Y-adapter        and a primer specific to the universal sequence    -   (f) purifying the resulting PCR product and proceeding with step        (iv).

FIG. 25 is a graph showing percent similarity of inferred trees to theoriginal tree as a function of transposition rate. Green:parsimony-based reconstruction; blue: UPGMA reconstruction. Datarepresent the mean of 1000 replicates.

FIG. 26A-FIG. 26B is a series of graphs demonstrating that the rate oftransposon insertion by a transposase-degradation domain fusion can becontrolled by titrating the amount of inducer present. FIG. 26A depictsa FACS plot of a fluorescent reporter of transcription with and withouttransposase, and FIG. 26B demonstrates that the rate of transposition(which is proportional to fluorescence) can be tuned by adding differentamounts of a chemical inducer.

DETAILED DESCRIPTION OF THE INVENTION

The present disclosure is based, at least in part, on the discovery of afirst-of-its-kind transposon system that reports its precise location(i.e., a self-reporting transposon (SRT)). Although other (e.g., genetrap) transposons have transcribed through their long terminal repeats(LTRs), they have never reported the LTR-genome junction.

Applications can include mapping the locations of SRTs from thousands ofsingle cells in parallel, while simultaneously measuring mRNA abundancefrom the same single cells; analyzing TF binding in a small number ofcells in bulk, without single cell resolution, which is also useful forsome applications; lineage tracing; or as an improved readout fortransposon mutagenesis screens.

In situ measurements of transcription factor (TF) binding are confoundedby cellular heterogeneity and represent averaged profiles in complextissues. Single cell RNA-seq (scRNA-seq) is capable of resolvingdifferent cell types based on gene expression profiles, but notechnology exists to directly link specific cell types to the bindingpattern of TFs in those cell types.

Described herein are self-reporting transposons (SRTs) and their use insingle cell calling cards (scCC), a novel assay for simultaneouslycapturing gene expression profiles and mapping TF binding sites insingle cells.

First, it was shown how the genomic locations of SRTs can be recoveredfrom mRNA. Next, it was demonstrated that SRTs deposited by the piggyBactransposase can be used to map the genome-wide localization of the TFsSP1, through a direct fusion of the two proteins, and BRD4, through itsnative affinity for piggyBac. The scCC method is then presented, whichmaps SRTs from scRNA-seq libraries, thus enabling concomitantidentification of cell types and TF binding sites in those same cells.Also shown is the recovery of cell type-specific BRD4 and SP1 bindingsites from cultured cells. Finally, Brd4 binding sites were mapped inthe mouse cortex at single cell resolution, thus establishing a newtechnique for studying TF biology in situ.

The present disclosure provides for methods for the mapping oftransposon insertions from single cells (providing for single cellanalysis). As described herein, a new transposon reporter system is usedto report sites of transcription factor (TF)-DNA interaction. Thepresent technology provides for methods to identify TF-DNA interactionsin vitro and in vivo. This technology not only maps the site ofinteraction but can also quantify the abundance of the transcript (i.e.,a length of RNA or DNA that has been transcribed respectively from a DNAor RNA template).

As described herein, the 3′ region flanking the terminal repeat istranscribed and the transcript is terminated by a cryptic poly-A signalsequence (e.g., AAUAAA sequence (SEQ ID NO: 12)) or transcriptionaltermination sequence in the genome allowing for quantitative single cellanalysis, cell fate mapping, or transposon mutagenesis screens, amongother applications.

The ability to chronicle transcription-factor binding events throughoutthe development of an organism can facilitate mapping of transcriptionalnetworks that control cell-fate decisions. This method permanentlyrecords protein-DNA interactions in mammalian cells. The transcriptionfactors are endowed with the ability to deposit a transposon into thegenome near to where they bind. The transposon becomes a “calling card”that the transcription factor leaves behind to record its visit to thegenome. The locations of the calling cards can be determined bymassively parallel DNA sequencing.

Previously developed Transposon Calling Cards, allow transient molecularinteractions to be captured non-destructively, during a controlledwindow in time, and then read out at a later point in time. The newself-reporting Calling Cards are transposons that have been engineeredto contain a strong promoter (e.g., Ef1 alpha) that drives thetranscription of a reporter gene that has no poly-A termination signal.When a self-reporting transposon is inserted into the genome, thereporter gene is transcribed, and transcription continues through thepiggyBac terminal repeat and into the neighboring genomic region until acryptic poly-A termination signal is encountered. The present disclosureprovides for methods of mapping transposon insertions from single cells,in conjunction with Drop-Seq or 10× Chromium technologies. Thesetechnologies analyze RNA expression genome-wide in thousands ofindividual cells at once.

Self-Reporting Transposon (SRT) Construct

The self-reporting transposon construct as described herein comprises atransposon that can report its location in the cellular RNA fraction.The SRT construct can be used in conjunction with a transposase or afusion of a transposase with a genome-associated protein (GAP; e.g., atranscription factor, a general transcriptional mediator).

Reporter

As described herein, the SRT construct can comprise a reporter. Thereporter can be any reporter known in the art. For example, the reportercan be a screenable or selectable reporter gene. As another example, thereporter can be a gene capable of encoding a fluorescent protein (e.g.,green fluorescent protein, tdTomato, eGFP, eCFP).

As another example, the reporter can be a gene capable of use as aselectable marker by conferring resistance to a chemical agent thatkills eukaryotic or prokaryotic cells (e.g., puromycinN-acetyl-transferase, providing resistance to puromycin; either of twoaminoglycoside 3′ phosphotransferase genes encoded by Tn5 and Tn601(i.e., the neo gene), providing resistance to G418; or hygromycinphosphotransferase, providing resistance to hygromycin).

As another example, the reporter can be an enzyme capable of convertinga chemical substrate into a colorimetric, luminescent, or fluorescentreporter (e.g., beta-galactosidase or beta-lactamase, cleaving x-gal andGeneBLAzer, respectively).

As described herein, a reporter gene having no poly-A terminationsignal, enables RNA polymerase II (Pol II) to transcribe the reportergene contained in the transposon and continue through the terminalrepeat (TR) or end of the transposon insertion element into the flankinggenomic sequence.

As another example, the reporter can be an interrupted reporter gene(e.g., itdTomato).

Promoter

As described herein, a promoter capable of initiating transcription canbe incorporated into the SRT construct, for example in the insertionelement (or donor). Any promoter capable of initiating transcription ofRNA or a reporter gene can be used. Preferably a constitutive promoteror a strong promoter, such as EF1a, CMV, or CAG can be used. In someembodiments, a T7 promoter can be used. Other promoters that can be usedinclude a ubiquitous promoter (e.g., Ubq-C) or an inducible promoter(e.g., a tet-inducible promoter). Cell-type specific promoters can beused. For example, a neuron specific promoter (e.g., Syn1) can be usedin the constructs described herein.

Insertion Element (Donor)

As described herein, the transposon system comprises an insertionelement capable of being inserted into a genome. An insertion element(also known as a donor, an IS, an insertion sequence element, or an ISelement) is a short DNA sequence that acts as a simple transposableelement (TE). The insertion element can comprise a reporter and/or apromoter. The insertion element can provide a sequence that can be readout and reported.

Transposons and Transposases

As described herein, the self-reporting transposon is inserted into acellular genome by the corresponding transposase protein, which can bedelivered to the cell separately from the transposon, or which can beencoded in the transposon DNA. Together, the transposon and transposasecomprise a transposon system. The transposon system can be anytransposon system known in the art (see e.g., Munoz-Lopez et al. 2010Curr Genomics. 11(2) 115-128). A transposase is an enzyme that binds tothe end of a transposon and catalyzes its movement to another part ofthe genome by a cut and paste mechanism or a replicative transpositionmechanism. More specifically, the transposase recognizes the terminalrepeats to excise the transposon DNA, which is then inserted into a newgenomic location by cut and paste or copy and paste mobilization. DNAtransposons can inactivate or alter the expression of genes by insertionwithin introns, exons, or regulatory region. But here the DNAtransposons are used to express an RNA transcript identifying thelocation of a TR. For example, a transposon can be a DDE transposon(e.g., piggyBac transposon, hyperactive variant of piggyBac transposon,Sleeping Beauty transposon, Ty5 transposon), a rolling circle/Y2transposon (e.g., Helitrons, such as Helraiser), or a TP-retrotransposon(e.g., LINE-1 retrotransposon).

Genome Associated Protein

As described herein, a genome-associated protein:transposase fusion canbe used to insert SRT constructs into cellular genomes. Anygenome-associated protein known in the art can be used.

As described herein, the genome-associated protein can be a trans-actingfactor or element. Trans-acting elements can be a transcription factor(TF) or other DNA-binding protein which recognizes and binds to specificsequences in a cis-acting element to initiate, enhance, or suppresstranscription. A transcription factor can regulate multiple genes or itmay work in a combinatorial or complex manner to bind to thecis-regulatory elements at multiple transcription factor binding sitesto generate a huge repertoire of unique and precise control patterns. Itis estimated that the human genome encodes approximately 1800transcription factors (Venter et al., 2001).

As described herein, a TF-transposase fusion can be used to deliver theSRT construct. For example, any TF known in the art can be used todeliver the SRT construct (e.g., Foxa1, Hnf4a, Sp1, Tbp, Hb9, Jun,Olig2, Ngn2, Creb, Fos, Egr1). As another example, the genome-associatedprotein can be any protein that interacts indirectly with genomic DNA,such as Brd4, Med1, Chd4, Chd2, Bap1, all of which bind to othergenome-associated proteins.

Transcription factors (TFs) (or sequence-specific DNA-binding factors)are proteins that control the rate of transcription of geneticinformation from DNA to messenger RNA, by binding to a specific DNAsequence. The function of TFs is to regulate-turn on and off-genes inorder to make sure that they are expressed in the right cell at theright time and in the right amount throughout the life of the cell andthe organism. Groups of TFs function in a coordinated fashion to directcell division, cell growth, and cell death throughout life; cellmigration and organization (body plan) during embryonic development; andintermittently in response to signals from outside the cell, such as ahormone.

A defining feature of TFs is that they contain at least one DNA-bindingdomain (DBD), which attaches to a specific sequence of DNA adjacent tothe genes that they regulate. TFs are grouped into classes based ontheir DBDs.

Other genome-associated proteins such as coactivators, chromatinremodelers, histone acetyltransferases, histone deacetylases, kinases,and methylases are also essential to gene regulation, but lackDNA-binding domains, and therefore are not considered TFs, but these canstill be used to direct the insertion of SRTs into the genome.

Regulation of Transposase Activity

As described herein, the genome-associated protein-transposase fusioncan be further linked to a destabilized domain (DD) to achieve temporalcontrol over transposition. Destabilized domains are ligand bindingproteins that have been mutated so that they are unstable and the DD, aswell any protein fused to the DD, is degraded by the cellular machinery.However, in the presence of the corresponding ligand molecule, termed ashield, the DD domains are stabilized. There are several proteins thatcan act as DDs, such as mutants of FKBP12 (with rapamycin as theshield), E. coli dihydrofolate reductase (DHFR, with trimethoprim as theshield), and the estrogen receptor protein (ERT2, with tamoxifen as theshield).

Delivery of SRTS and Transposases

As described herein, the SRT construct can be delivered to the cell viaknown methods of gene delivery. The SRT construct can delivered to acell via a gene delivery method such as electroporation, lipofection, aviral vector (e.g., lentivirus, adenovirus, herpes simplex virus,adeno-associated virus), micro-injection, sonoporation, ormagnetofection. The transposase can be delivered to the cell encode onDNA or RNA, or directly as a protein, using known methods of DNAdelivery (as listed above) or known methods for RNA or protein delivery,such as electroporation, lipofection, a viral vector (e.g. integrationdefective lentivirus), micro-injection, sonoporation, or magnetofection.

Mapping Locations of SRTS

The present disclosure provides methods for mapping the locations ofSRTs from cellular RNA, which provides advantages over other methods oftransposon mapping. The method, which can be used to analyze cells withSRTs transposed into their genomic DNA, can comprise the followingsteps:

(i) harvesting total RNA in bulk from the cells;

(ii) reverse transcribing mRNA into cDNA using a poly-T primer tailedwith a universal sequence (e.g., the SMART primer);

(iii) PCR amplifying the cDNA using a primer specific either for thetransposon end or the reporter gene and a primer (optionallybiotinylated) specific to the universal sequence;

(iv) tagmenting the PCR product (e.g., using a Nextera kit);

(v) amplifying using the transposon end primer and tagmentation primerssuitable for amplifying the PCR product encoding the junction betweenthe inserted transposon and the genome (e.g., a Nextera primer); and

(vi) sequencing the tagged DNA fragments by employing 2^(nd) or 3^(rd)generation sequencing technology (e.g., Illumina or PacBio) usingsequencing primers that are designed so that the transposon-genomejunction is sequenced.

Simultaneous Measurement of SRT Locations and mRNA Transcriptomes fromSingle Cells

The present disclosure provides for methods of mapping the locations ofSRTs and simultaneously measuring mRNA abundance from thousands ofsingle cells in parallel (using a protocol combined from 10× Genomics orDrop-Seq single cell methodology and an Illumina paired end kit) (seee.g., Example 1). The method can comprise the following steps:

(i) Converting the mRNA from single cells into cDNA that is labeled atits 3′ end with a cell barcode, a unique molecular index (UMI), and auniversal priming sequence using known methods for single-cell RNA-seqsuch as 10× Chromium, Drop-Seq, or InDrop. All of the mRNA moleculesfrom a single cell will be tagged with the same cell barcode, butdifferent UMIs.

(ii) Separating the pooled cDNA from all cells analyzed in theexperiment into two fractions.

(iii) Recovering the transcriptomes of the single cells by completingthe single cell RNA-seq protocol chosen in step (i) using one of thecDNA fractions.

(iv) Mapping the genomic locations of SRTs and assigning the SRTs tocell barcodes by circularizing SRT cDNA to physically bring the cellbarcode in apposition to the insertion site and then performing Illuminasequencing. This is achieved by

-   -   (a) amplifying SRT cDNA by performing PCR with primers that bind        to the universal priming sequence next to the cell barcode and        the transposon end. These primers are biotinylated and carry a        5′ phosphate group.    -   (b) optionally, removing unwanted transposon sequence from the        PCR products by following the method of claim 32    -   (c) diluting the PCR products of this amplification and        performing a self-ligation reaction.    -   (d) shearing the self-ligated products and capturing the        ligation junction and flanking sequences by pulling these        fragments down with streptavidin-coated magnetic beads.    -   (e) preparing this fragment for Illumina sequencing by        performing end repair, A-tailing, and adapter ligation “on        bead”.    -   (f) performing a final PCR step to add the required Illumina        sequences for high-throughput sequencing. The standard Illumina        read 1 primer will anneal and read the cell barcode and UMI,        while a custom read 2 primer, annealing to the end of the        transposon, reads into the genome.    -   (g) analyzing the Illumina sequencing reads to collect both the        location of each SRT insertion as well as the cell barcode        corresponding to its cell of origin.

Computational

Described herein are computational methods to filter out intermolecularligations and software to (i) identify single cell calling cardinsertions; (ii) to simulate single cell calling card insertions acrossa monoclonal population; and (iii) analyze self-reporting Calling Cards.

The methods and algorithms of the invention may be enclosed in acontroller or processor. Furthermore, methods and algorithms of thepresent invention, can be embodied as a computer implemented method ormethods for performing such computer-implemented method or methods, andcan also be embodied in the form of a tangible or non-transitorycomputer readable storage medium containing a computer program or othermachine-readable instructions (herein “computer program”), wherein whenthe computer program is loaded into a computer or other processor(herein “computer”) and/or is executed by the computer, the computerbecomes an apparatus for practicing the method or methods. Storage mediafor containing such computer program include, for example, floppy disksand diskettes, compact disk (CD)-ROMs (whether or not writeable), DVDdigital disks, RAM and ROM memories, computer hard drives and back-updrives, external hard drives, “thumb” drives, and any other storagemedium readable by a computer. The method or methods can also beembodied in the form of a computer program, for example, whether storedin a storage medium or transmitted over a transmission medium such aselectrical conductors, fiber optics or other light conductors, or byelectromagnetic radiation, wherein when the computer program is loadedinto a computer and/or is executed by the computer, the computer becomesan apparatus for practicing the method or methods. The method or methodsmay be implemented on a general purpose microprocessor or on a digitalprocessor specifically configured to practice the process or processes.When a general-purpose microprocessor is employed, the computer programcode configures the circuitry of the microprocessor to create specificlogic circuit arrangements. Storage medium readable by a computerincludes medium being readable by a computer per se or by anothermachine that reads the computer instructions for providing thoseinstructions to a computer for controlling its operation. Such machinesmay include, for example, machines for reading the storage mediamentioned above.

Kits

Also provided are kits. Such kits can include an agent or compositiondescribed herein and, in certain embodiments, instructions foradministration. Such kits can facilitate performance of the methodsdescribed herein. When supplied as a kit, the different components ofthe composition can be packaged in separate containers and admixedimmediately before use. Components include, but are not limited to acomposition comprising a GAP, a transcription factor, a transposon, areporter, or a transposon base. Such packaging of the componentsseparately can, if desired, be presented in a pack or dispenser devicewhich may contain one or more unit dosage forms containing thecomposition. The pack may, for example, comprise metal or plastic foilsuch as a blister pack. Such packaging of the components separately canalso, in certain instances, permit long-term storage without losingactivity of the components.

Kits may also include reagents in separate containers such as, forexample, sterile water or saline to be added to a lyophilized activecomponent packaged separately. For example, sealed glass ampules maycontain a lyophilized component and in a separate ampule, sterile water,sterile saline or sterile each of which has been packaged under aneutral non-reacting gas, such as nitrogen. Ampules may consist of anysuitable material, such as glass, organic polymers, such aspolycarbonate, polystyrene, ceramic, metal or any other materialtypically employed to hold reagents. Other examples of suitablecontainers include bottles that may be fabricated from similarsubstances as ampules, and envelopes that may consist of foil-linedinteriors, such as aluminum or an alloy. Other containers include testtubes, vials, flasks, bottles, syringes, and the like. Containers mayhave a sterile access port, such as a bottle having a stopper that canbe pierced by a hypodermic injection needle. Other containers may havetwo compartments that are separated by a readily removable membrane thatupon removal permits the components to mix. Removable membranes may beglass, plastic, rubber, and the like.

In certain embodiments, kits can be supplied with instructionalmaterials. Instructions may be printed on paper or other substrate,and/or may be supplied as an electronic-readable medium, such as afloppy disc, mini-CD-ROM, CD-ROM, DVD-ROM, Zip disc, videotape, audiotape, and the like. Detailed instructions may not be physicallyassociated with the kit; instead, a user may be directed to an Internetweb site specified by the manufacturer or distributor of the kit.

Compositions and methods described herein utilizing molecular biologyprotocols can be according to a variety of standard techniques known tothe art (see, e.g., Sambrook and Russel (2006) Condensed Protocols fromMolecular Cloning: A Laboratory Manual, Cold Spring Harbor LaboratoryPress, ISBN-10: 0879697717; Ausubel et al. (2002) Short Protocols inMolecular Biology, 5th ed., Current Protocols, ISBN-10: 0471250929;Sambrook and Russel (2001) Molecular Cloning: A Laboratory Manual, 3ded., Cold Spring Harbor Laboratory Press, ISBN-10: 0879695773; Elhai, J.and Wolk, C. P. 1988. Methods in Enzymology 167, 747-754; Studier (2005)Protein Expr Purif. 41(1), 207-234; Gellissen, ed. (2005) Production ofRecombinant Proteins: Novel Microbial and Eukaryotic Expression Systems,Wiley-VCH, ISBN-10: 3527310363; Baneyx (2004) Protein ExpressionTechnologies, Taylor & Francis, ISBN-10: 0954523253).

Definitions and methods described herein are provided to better definethe present disclosure and to guide those of ordinary skill in the artin the practice of the present disclosure. Unless otherwise noted, termsare to be understood according to conventional usage by those ofordinary skill in the relevant art.

In some embodiments, numbers expressing quantities of ingredients,properties such as molecular weight, reaction conditions, and so forth,used to describe and claim certain embodiments of the present disclosureare to be understood as being modified in some instances by the term“about.” In some embodiments, the term “about” is used to indicate thata value includes the standard deviation of the mean for the device ormethod being employed to determine the value. In some embodiments, thenumerical parameters set forth in the written description and attachedclaims are approximations that can vary depending upon the desiredproperties sought to be obtained by a particular embodiment. In someembodiments, the numerical parameters should be construed in light ofthe number of reported significant digits and by applying ordinaryrounding techniques. Notwithstanding that the numerical ranges andparameters setting forth the broad scope of some embodiments of thepresent disclosure are approximations, the numerical values set forth inthe specific examples are reported as precisely as practicable. Thenumerical values presented in some embodiments of the present disclosuremay contain certain errors necessarily resulting from the standarddeviation found in their respective testing measurements. The recitationof ranges of values herein is merely intended to serve as a shorthandmethod of referring individually to each separate value falling withinthe range. Unless otherwise indicated herein, each individual value isincorporated into the specification as if it were individually recitedherein.

In some embodiments, the terms “a” and “an” and “the” and similarreferences used in the context of describing a particular embodiment(especially in the context of certain of the following claims) can beconstrued to cover both the singular and the plural, unless specificallynoted otherwise. In some embodiments, the term “or” as used herein,including the claims, is used to mean “and/or” unless explicitlyindicated to refer to alternatives only or the alternatives are mutuallyexclusive.

The terms “comprise,” “have” and “include” are open-ended linking verbs.Any forms or tenses of one or more of these verbs, such as “comprises,”“comprising,” “has,” “having,” “includes” and “including,” are alsoopen-ended. For example, any method that “comprises,” “has” or“includes” one or more steps is not limited to possessing only those oneor more steps and can also cover other unlisted steps. Similarly, anycomposition or device that “comprises,” “has” or “includes” one or morefeatures is not limited to possessing only those one or more featuresand can cover other unlisted features.

All methods described herein can be performed in any suitable orderunless otherwise indicated herein or otherwise clearly contradicted bycontext. The use of any and all examples, or exemplary language (e.g.,“such as”) provided with respect to certain embodiments herein isintended merely to better illuminate the present disclosure and does notpose a limitation on the scope of the present disclosure otherwiseclaimed. No language in the specification should be construed asindicating any non-claimed element essential to the practice of thepresent disclosure.

Groupings of alternative elements or embodiments of the presentdisclosure disclosed herein are not to be construed as limitations. Eachgroup member can be referred to and claimed individually or in anycombination with other members of the group or other elements foundherein. One or more members of a group can be included in, or deletedfrom, a group for reasons of convenience or patentability. When any suchinclusion or deletion occurs, the specification is herein deemed tocontain the group as modified thus fulfilling the written description ofall Markush groups used in the appended claims.

All publications, patents, patent applications, and other referencescited in this application are incorporated herein by reference in theirentirety for all purposes to the same extent as if each individualpublication, patent, patent application or other reference wasspecifically and individually indicated to be incorporated by referencein its entirety for all purposes. Citation of a reference herein shallnot be construed as an admission that such is prior art to the presentdisclosure.

Having described the present disclosure in detail, it will be apparentthat modifications, variations, and equivalent embodiments are possiblewithout departing the scope of the present disclosure defined in theappended claims. Furthermore, it should be appreciated that all examplesin the present disclosure are provided as non-limiting examples.

EXAMPLES

The following non-limiting examples are provided to further illustratethe present disclosure. It should be appreciated by those of skill inthe art that the techniques disclosed in the examples that followrepresent approaches the inventors have found function well in thepractice of the present disclosure, and thus can be considered toconstitute examples of modes for its practice. However, those of skillin the art should, in light of the present disclosure, appreciate thatmany changes can be made in the specific embodiments that are disclosedand still obtain a like or similar result without departing from thespirit and scope of the present disclosure.

Example 1: Methods for the Mapping of Transposon Insertions from SingleCells

This example describes a technology to reliably measure TF binding fromsingle cells. The inventors have previously developed transposon‘Calling Cards’, a method whereby transcription factor (TF) binding ismapped by fusing the transcription factor to the transposase of atransposon, endowing the TF with the ability to deposit transposon DNAnear to where it binds. Here is designed a new type of transposon, aself-reporting transposon (SRT), that reports its location in the genomein RNA. Described herein are methods for mapping the locations of SRTsfrom thousands of single cells in parallel, while simultaneouslymeasuring mRNA abundance from the same single cells. Using thesemethods, transcription factor binding and mRNA abundance can be mappedfrom thousands of single cells obtained from a heterogeneous mixture(e.g., brain tissue). Also described here are methods for analyzing TFbinding in a small number of cells in bulk (e.g., 100-10000 cells),without single cell resolution, which is also useful for someapplications. Finally, this methodology and how it could be used forother applications such as lineage tracing or as an improved readout fortransposon mutagenesis screens is described. These methods are describedin detail, herein, but briefly described here to point out the featuresthat make the presently disclosed methods new and different.

Self-Reporting Calling Cards

Self-reporting Calling Cards builds on the inventors' previoustransposon Calling Card method, but represents a siqnificant paradiqmshift. Transposon Calling Cards allow transient molecular interactionsto be captured non-destructively, during a controlled window in time,and then read out at a later point in time. This is achieved by fusingany TF to the piggyBac transposase, which bestows on the TF the abilityto direct transposon insertion into the genome near to where it binds(see e.g., FIG. 1A). This transposon sequence then acts as a “CallingCard” that permanently tags a transient TF-DNA interaction. Bysequencing the tags from the genomes of the cells at a later time (e.g.,after reprogramming), the molecular events that occurred earlier can beread out. Self-reporting Calling Cards are transposons that have beenengineered to contain a strong promoter (e.g., Efl alpha) that drivesthe transcription of a reporter gene that has no poly-A terminationsignal (see e.g., FIG. 1B). When a self-reporting transposon is insertedinto the genome, the reporter gene is transcribed, and transcriptioncontinues through the piggyBac terminal repeat and into the neighboringgenomic region until a cryptic poly-A termination signal is encountered.Since poly-A termination is governed largely by the pentamer AAUAAA (SEQID NO: 12), transcription terminates, on average, 4⁶ or 4096 bp into thegenome. Because the EF1α promoter drives transcription in nearly allchromatin states that are capable of being bound by a TF, thesetransposons “report” their genomic positions via mRNA transcription(H3K9Me2/3-marked heterochromatin, for example, is likely to silenceEF1a, but this type of chromatin is not bound by pioneer TFs). Thelocations of these self-reporting transposons can be read out in singlecells by making minor modifications to existing high-throughput singlecell RNA-Seq protocols such as Drop-Seq. Some transcripts from insertedtransposons either (1) do not acquire a poly-A tail or (2) acquire apoly-A tail far from the insertion site and are primed off of genomicpoly-A homopolymers in the transcript (there is evidence that thishappens).

Although transposons have been engineered with promoters to initiatetranscription, they either transcribe a reporter gene with a poly-Asignal, or they transcribe an artificial exon with a splice donor siteso that, after the transposon is inserted into the genome the artificialexon is spliced onto nearby genomic exons. What is unique here is 1) theregion immediately flanking the transposon terminal repeat istranscribed and reported in cellular mRNA and 2) the transcript isterminated by a cryptic poly-adenylation signal or picks up a poly-Astretch in the genome so that the transcript can be recovered by reversetranscription using a poly-T primer.

Use of a Hammerhead Ribozyme to Prevent Unwanted Donor Recovery

One problem that was frequently encountered when performing the CallingCard protocol was unwanted donor transposon recovery. When a callingcard experiment was read out, the locations of inserted transposons weremapped by next-generation sequencing. However, unless special methodsare used, both transposons inserted into the genome by theTF-transposase fusion as well as un-inserted donor transposons arerecovered. These donor transposons are usually delivered to the cell ona plasmid via electroporation, or in an AAV virus via infection,although they can also be engineered to reside in the cellular genome.The method that works well for SRT recovery is to use a hammerheadribozyme to prevent unwanted donor recovery. Describe herein is how thisworks when a piggyBac donor transposon is delivered to the cell viaplasmid electroporation, but the same principles apply when donortransposons are delivered via AAV or engineered into the cellulargenome. The donor plasmid is designed so that a strong promoter insideof the piggyBac transposon drives expression of a reporter gene that hasno poly-A termination signal, as previously shown in FIG. 1B. However,in the donor plasmid, a sequence encoding a hammerhead ribozyme isplaced immediately downstream of the piggyBac terminal repeat (see e.g.,FIG. 2A). Thus, when RNA is transcribed from the donor transposon, it isimmediately cleaved due to the action of the hammerhead ribozyme. Thesetranscripts, which lack a poly-A tail, are degraded by the cellularmachinery, and those that are not degraded are not reverse transcribed,because they have no binding site for the poly-T primer, and thus arestill not recovered. However, when a self-reporting transposon isinserted into the genome, the transposon is no longer adjacent to thehammerhead ribozyme, so transcription continues through the piggyBacterminal repeat and into the neighboring genomic region until a crypticpoly-A termination signal is encountered (see e.g., FIG. 2B). Anadditional feature of this invention is that because transcripts thatare not poly-adenylated are efficiently degraded, the reporter geneeffectively reports on whether a transposition event has occurred or notin the cell. For example, if the reporter is the tdtomato gene, redcells correspond to those in which a transposition event has occurred.

It is believed that this method has never been previously described. Itwas discovered here that it is extremely efficient at removing unwanteddonor recovery.

Methods for Mapping TF Binding and mRNA Content from Thousands of SingleCells in Parallel

Recently, two methods, Drop-Seq and In-Drops, have been developed thatcan analyze the mRNA content from thousands of single cells in parallel.10× genomics has also developed a commercially available platform calledChromium that works in essentially the same way as these two methods.Methods are described herein for mapping transcription factor bindingand mRNA content from thousands of single cells in parallel with a novelcalling card protocol that uses the presently disclosed SRTs. Here isalso described how this protocol is performed using the Drop-Seqplatform, but it is well within the skill in the art to modify theprotocol for In-Drops or the 10× genomics platform (the platformcurrently preferred).

Drop-Seq is a method which can analyze the genome-wide mRNA levels oftens of thousands of cells per day at a cost 5-7 cents per cell.Drop-Seq uses microparticle beads that are coated with oligonucleotidesencoding a stretch of 30 poly-Ts at their 3′ ends. Each oligonucleotidealso possesses a 12-base pair (bp) cell barcode, shared across allsequences on the same bead, and an 8-bp molecular identifier which isunique to each sequence (see e.g., FIG. 2A). Using a microfluidicdevice, beads in lysis buffer intersect with a flow of single cells insuspension. An oil stream splits this aqueous stream into droplets,where a proportion of these droplets contain one cell and one bead. Oncecombined, the cell lyses and polyadenylated transcripts are captured onthe polythymidine portion of the bead oligonucleotides. Followingrecovery of the transcriptome-loaded beads (STAMPS: single-celltranscriptomes attached to microparticles) from this emulsion, librarypreparation is performed in bulk where single-cell resolution isretained as a result of incorporation of the STAMP barcode into thecDNA. cDNA amplification is followed by tagmentation, producing alibrary of 3′ transcript ends tagged with a barcode denotingcell-of-origin (see e.g., FIG. 2A). The Drop-Seq workflow can process aremarkably large number of cells each day; for example, in the seminalpublication describing this technology, over 44,000 cells were analyzed.FIG. 2B demonstrates this technology's ability to fully resolve amixture of human embryonic kidney cells and mouse embryonic fibroblastsat the single-cell level.

The main technical challenge in this aim is developing a protocol bywhich self-reporting Calling Cards can be read out by a bead-basedsingle cell methodology such as Drop-Seq. Self-reporting transposonscontain a strong EF1α promoter that drives the transcription of mRNAmolecules that extend through the terminal repeat of the transposon andinto the genome (see e.g., FIG. 1), so that the location of the CallingCard is reported in mRNA. To map a transposon location and assign it toa cell, the STAMP, which is on the 3′ end of the cDNA molecule, must beassociated with the junction between the transposon and the genome,which is on the 5′ end of the cDNA molecule. To do so, the followingworkflow was developed:

1. mRNA molecules from a single cell are captured on a STAMP-encodedbead in a Drop-Seq droplet as per the standard Drop-Seq protocol (seee.g., FIG. 3A).

2. All beads are pooled and reverse transcription and template switchingis performed (see e.g., FIG. 3B). Each cDNA molecule has a SMART primeron the 5′ end and a SMARTer primer on the 3′ end (see e.g., FIG. 3B).These primers differ only by 3 base pairs at their 3′ terminal ends.

3. All transcripts are PCR amplified using primers that bind the SMARTand SMARTer regions.

4. The post-amplified PCR reaction is split in two. Half of the reactionis carried through the standard Drop-Seq protocol to obtain mRNA levelsfor thousands of single cells.

5. Calling Card-generated cDNA is amplified using a biotinylated SMARTprimer and a biotinylated primer specific for the transposon terminalrepeat (see e.g., FIG. 3D). From here on, the protocol used follows thestandard Illumina mate pair workflow unless otherwise noted.

6. This product is circularized by ligation, sheared via sonication, andthe biotinylated fragments are pulled down with streptavidin beads (seee.g., FIG. 3E-FIG. 3E).

7. Illumina Y-adapters are ligated onto the fragments, an amplificationPCR is performed, and the product is loaded on an Illumina NextSeqsequencer (see e.g., FIG. 3H-FIG. 3I).

After sequencing, all mRNA transcripts and Calling Card derivedtranscripts are mapped to the genome and assigned to a STAMP. STAMPs areclustered by mRNA expression and Calling Card insertions from similarcell types are analyzed together to increase statistical power.

Methods for Recovering SRTs and mRNA Content from Thousands of SingleCells in Parallel

The circularization protocol described above is an extension of theillumina paired end kit used to generate paired end reads for genomicsequencing. But the application of this methodology to the problem athand is highly novel. No other method has been described forDrop-Seq/lnDrops/10× that connects the cellular barcode at the 3′ end ofthe cDNA molecule to the 5′ end of the molecule.

Molecular Triangulation

Molecular triangulation is a computational method to filter outintermolecular ligations in the analysis of single cell calling cards.One problem with the method described above is that sometimesintermolecular ligations occur, rather than the desired intramolecularligation. When this happens, an insertion can be assigned to the wrongcell. Here is described a method to filter out intermolecular ligations.To do so, first the circularization protocol described above wasperformed. From the same sample, the 5′ ends (without cellular barcodeinformation) and the 3′ ends (with cellular barcode information but withimprecise information about the insertion site) were mapped. Then the 5′and 3′ end pairs identified by the circularization protocol werefiltered for valid pairs. A valid circularization pair is a 5′ and 3′pair for which the transposon insertion site (in the 5′ part of thepair) is found in the 5′ end mapping data, and the cellular barcode isfound in the 3′ mapping data and location of the 3′ end maps nearby the5′ end on the genome (e.g., less than 5 kb). Another error correctionscheme is to require that a given 5′-3′ pair in the circularization datais supported by at least N UMIs, where N is at least 2.

It is believed that no similar methods have been described.

Recovering SRTs from Bulk Samples

Small modifications to the protocol can be used to recover transposonsin bulk from small numbers of cells. Briefly, RNA is harvested fromcells, and reverse transcribed using a poly-T primer tailed with auniversal sequence (e.g., the SMART primer). The cDNA is then amplifiedby PCR using a primer specific for the transposon LTR and the universalsequence, tagmented using the Nextera kit, and amplified using the LTRprimer and the Nextera primer and then sequenced. This is identical tothe 5′ end recovery protocol used in the triangulation method. Thismethod allows for the Calling Card protocol to be performed with fewercells and also greatly reduces unwanted donor recovery.

SRT Methods for Other Transposons

This technology can be used to map other transposons including the L1retrotransposon, the Sleeping Beauty transposon, and the Helitron.Constructs were generated for all three of these transposons. Thecalling card protocol was demonstrated with the Sleeping Beautytransposon system. This may be useful because the unfused SleepingBeauty transposase inserts transposons into the genome in a more randomfashion that does the piggyBac transposase.

Modification—Recovering Transposon Insertion Locations Using a T7Promoter

For some applications, it may not be desirable to have an activepromoter generate the mRNA transcripts that span the transposon genomejunction. For example, in some cell types, a given promoter may not beactive. Also, there may be some locations in the genome that silencepromoters. An alternative strategy is to include a bacteriophage T7promoter in the transposon, perform the experiment, collect cellularDNA, and then generate the RNA that spans the transposon-genome junctionvia an in vitro transcription reaction. For example, one could harvestcellular DNA, shear, ligate on a Y-linker with a universal primer,perform an in vitro transcription reaction, perform first strandsynthesis using the universal priming sequence, and PCR amplify using auniversal primer and a primer targeting the transposon terminal repeat.If these primers are tailed with the Illumina P5 and P7 (and optionallySeq1 and Seq2) sequences, they could be directly loaded on an Illuminasequencer and sequenced.

16s Ribosomal RNA

An alternative method to the ribozyme for removing unwanted donorrecover is to insert a 16s ribosomal RNA sequence in the donor constructimmediately adjacent to the piggyBac terminal repeat. Bacterial RNA-Seqkits routinely and efficiently remove 16s ribosomal RNA from otherbacterial transcripts, so these kits can be used to remove RNAtranscripts that are generated by “unhoped” donor molecules.

WPRE Element

The Woodchuck Hepatitis Virus (WHP) Posttranscriptional RegulatoryElement (WPRE) is a DNA sequence that when transcribed produces an RNAtertiary structure that causes the mRNA molecule to be extremely stable.By including a WPRE element in the SRT, the RNA molecules produced upongenomic insertion will be stabilized.

Single Cell Lineage Tracing

Natural transposition events are already used to infer phylogeneticrelationships between species. The methods described above can be usedto perform a similar kind of analysis but at cellular, instead ofgeologic, time scales. The single cell calling cards protocol describedabove can be used with only minor modifications to perform lineagetracing. Rather than using TF-directed piggyBac transposases, wild-type,undirected transposases that can integrate anywhere in the genome can beused. Each insertion event can be thought of as a lineage-specificbarcode; the content and distribution of lineage barcodes can be used toinfer somatic phylogenies. By performing single cell calling cards on aheterogeneous population of cells, novel cell types can be identifiedand their genealogical context can be reconstructed at the same time.

Reading Out Transposon Mutagenesis Screens

The methods described herein can be used for any application in which itis useful to map the locations of engineered transposons. For example,transposon mutagenesis is a widely used technique, and the methodsdescribed here can be applied to this problem.

The benchmark for any single cell genomic technology is to take two celltypes from different species, mix them together, perform the assay, andassign mapped reads to single cells. If successful, each cell will showreads mapping to either one or the other species, but not both. Asdescribed herein, near-perfect species separation of single cell callingcard data has been achieved (see e.g., FIG. 7C). Furthermore, thelocations of these insertions in HEK293 and mouse N2a cells were mapped.There is excellent concordance with previously measured Brd4 binding andH3K27ac (see e.g., FIG. 4). Together, these results demonstrate thatthis technology works.

Example 2: Self-Reporting Transposons Enable Simultaneous Readout ofGene Expression and Transcription Factor Binding in Single Cells

The following example describes self-reporting transposons (SRTs) andtheir use in single cell calling cards (scCC), a novel assay forsimultaneously capturing gene expression profiles and mapping TF bindingsites in single cells.

First, how the genomic locations of SRTs can be recovered from mRNA isdemonstrated. Next, it is shown how SRTs deposited by the piggyBactransposase can be used to map the genome-wide localization of the TFsSP1, through a direct fusion of the two proteins, and BRD4, through itsnative affinity for piggyBac. Then, the scCC method is presented, whichmaps SRTs from scRNA-seq libraries, thus enabling concomitantidentification of cell types and TF binding sites in those same cells.As a proof-of-concept, the recovery of cell type specific BRD4 and SP1binding sites from cultured cells was demonstrated. Finally, Brd4binding sites in the mouse cortex at single cell resolution were mapped,thus establishing a new technique for studying TF biology in situ.

Introduction

Transcription factors (TFs) regulate gene expression during the mostcritical junctures in the specification of cell fate. They are centralto the maintenance of stem cell pluripotency and required for normalorganogenesis during development. Overexpression of certain TFs cantransdifferentiate one cell type into another, while abolishing TFbinding sites can result in striking global phenotypes. Furthermore, thepattern of TF binding is often dysregulated during disease states. Abetter understanding of TF binding during tissue development andhomeostasis would provide important insights into how cellular diversityarises and is maintained under normal and abnormal biologicalconditions.

In the past few years, single cell RNA-seq (scRNA-seq) techniques haveemerged as the de facto methods for characterizing cellular diversity incomplex tissues and organisms. More recently, multi-modal scRNA-seqtechnologies have been developed that combine transcriptionalinformation with other genomic assays. These technologies are motivatedby the realization that while scRNA-seq can describe the current stateof a biological system, it alone cannot explain how that state arose.Thus, for a given population of cells, one can now simultaneouslymeasure transcriptome and genome, or methylome, or chromatinaccessibility, or cell-surface markers. These methods enable greaterinsight into the regulatory elements driving individual transcriptionalprograms.

A notable lacuna in the single cell repertoire is a method forsimultaneously assaying transcriptome and TF binding. Such a methodwould allow for the genome-wide identification of TF binding sitesacross multiple cell types in complex tissues. ChIP-seq is the mostpopular technique for studying TF binding, and while single cellChIP-seq has been previously described, this technique has only beenemployed to map highly abundant proteins such as methylated histones.DamlD can recover TF binding sites by identifying nearby exogenouslymethylated adenines, but in single cells it has only been used to studylaminin-associated domains. Importantly, both methods yield sparse dataand neither technique simultaneously captures mRNA. Thus, each can onlybe used in a cell type specific manner if the cell type is known apriori and if sufficient numbers of cells are obtained by selection orsorting to overcome sparsity. In contrast, a single cell assay fortransposase-accessible chromatin (scATAC-seq) can be used to identifynucleosome-free regions that may be bound by TFs across large numbers ofmixed cells. However, it can only suggest potential DNA binding proteinsby motif inference. It is therefore not a direct measurement of TFoccupancy, and moreover it cannot be used to study transcriptionalregulators that bind DNA indirectly or non-specifically, such aschromatin remodelers.

Transposon calling cards have been previously developed by the inventorsas an alternative method to study TF binding. This system relies on twocomponents: a fusion between a GAP, such as a TF and a transposase, anda transposon carrying a reporter gene. The fusion transposase depositstransposons near TF binding sites; these insertions are subsequentlyamplified from genomic DNA and subjected to high-throughput sequencing.Thus, the redirected transposase leaves “calling cards” at the genomiclocations it has visited, which can then be identified later in time.The result is a genome-wide assay of all binding sites for thatparticular TF. In mammalian cells, piggyBac transposase fused to the TFSP1 has been heterologously expressed and the resulting pattern ofinsertions reflects SP1's DNA binding preferences. However, the methodas described above was only feasible in bulk preparations.

Here is presented single cell calling cards (scCC), an extension oftransposon calling cards that simultaneously profiles mRNA abundance andTF binding at single cell resolution. The key component of this work isa novel construct called the self-reporting transposon (SRT). UsingSRTs, the genomic locations of inserted transposons can be mapped fromeither mRNA or DNA, but the use of mRNA enables both higher efficiencyand compatibility with single-cell transcriptomics.

Here, it was established that TF-directed SRTs, in bulk, retain theability to accurately identify TF binding sites. Next, it wasdemonstrated that the unfused piggyBac transposase, through its nativeaffinity for the bromodomain TF BRD4, can be used to identify BRD4-boundsuper-enhancers (SEs). Also presented is the scCC method, which allowscell-specific mapping of SRTs from scRNA-seq libraries. This enables, inone experiment, concomitant assignment of cell types and identificationof TF binding sites within those cells. In an experiment, scCC was usedto map BRD4 and SP1 sites in mixtures of cultured human cells. This workconcludes by identifying cell type-specific Brd4 binding sites in vivoin the postnatal mouse cortex. These results demonstrate that scCC canbe a broadly applicable tool for the study of specific TF bindinginteractions across all cell types within a complex, multi-cellulartissue.

Results

Self-Reporting Transposons can be Mapped from mRNA Instead of GenomicDNA

In order to combine scRNA-seq with calling cards, a transposon wasdeveloped whose genomic position could be determined from mRNA. ApiggyBac self-reporting transposon (SRT) was created by removing thepolyadenylation signal from the standard DNA-based calling card vector(see e.g., FIG. 5A). This enables RNA polymerase II (Pol II) totranscribe the reporter gene contained in the transposon and continuethrough the terminal repeat (TR) into the flanking genomic sequence.Thus, SRTs “self-report” their locations through the unique genomicsequence found within the 3′ untranslated regions (UTRs) of thesereporter gene transcripts. Although previously published gene- orenhancer-trap transposons could, in principle, also capture positionallocal information via RNA, they are resolution-limited to the nearestgene or enhancer, respectively. In contrast, the 3′ UTRs of SRT-derivedtranscripts contain the transposon-genome junction in the mRNA sequence,so insertions can be mapped with base pair precision.

SRTs are mapped following reverse transcription (RT) and PCRamplification of self-reporting transcripts. These transcripts containstretches of adenines that are derived from either crypticpolyadenylation signals (PAS) or polyadenine tracts encoded in genomicDNA downstream of the SRT insertion point (see e.g., FIG. 1B). A poly(T)RT primer hybridizes with these transcripts and introduces a universalpriming site at one end of the transcripts. A pair of nested PCRs withan intermediate tagmentation step enable recovery of thetransposon-genome junction. After adapter trimming and alignment, the 5′coordinates of these reads identify the genomic locations of insertionsin the library. Libraries generated without transposase produce very fewgenomically mapped reads but the protocol is highly efficient whentransposase is added (see e.g., FIG. 9A).

To compare transposon recovery between the new RNA-based protocol andthe standard DNA-based inverse PCR protocol, HCT-116 cells weretransfected with a plasmid carrying a piggyBac SRT (PB-SRT-Puro) and aplasmid encoding a fusion of the TF SP1 and piggyBac transposase(SP1-PBase; see e.g., FIG. 5A). After two weeks of selection,approximately 2,300 puromycin-resistant clones were obtained. Thesecells were split in half: one half underwent inverse PCR while the otherhalf were processed with the new RNA workflow. With inverse PCR, 31,001insertions were obtained (mean coverage: 709 reads per insertion), whilethe RNA-based protocol recovered 62,500 insertions (mean coverage: 240reads per insertion). About 80% of insertions recovered by DNA callingcards were also recovered in the RNA-based library (25,060 insertions;see e.g., FIG. 5C), an overlap comparable to that between technicalreplicates of RNA recovery (see e.g., FIG. 9B). However, the RNAprotocol recovered a further 37,440 insertions that were not found inthe DNA-based library. To determine if these extra insertions weregenuine, the distribution of insertions was analyzed by geneticannotation (see e.g., FIG. 5D) or chromatin state (see e.g., FIG. 9C andFIG. 20). Transposons mapped from either the DNA or the RNA librariesshowed comparable distribution into annotated domains of particularfunctional or chromatin states, indicating that RNA recovery oftransposons appears to be unbiased with respect to the presentlydescribed established, DNA-based protocol.

Because piggyBac is known to preferentially insert near activechromatin, SRT recovery may be biased towards euchromatic regions.Previous reports have shown that the Sleeping Beauty transposase hasvery little preference for chromatin state. A self-reporting SleepingBeauty transposon was created and its genome-wide distribution comparedto that of SRTs deposited by wild-type piggyBac (see e.g., FIG. 10A andFIG. 10B). Undirected piggyBac transposases appeared to modestly prefertransposing into promoter and enhancers, which is consistent withprevious reports (see e.g., FIG. 20). By contrast, Sleeping Beautyshowed largely uniform rates of insertions across all chromatin states,including repressed and inactive chromatin (see e.g., FIG. 10B). Theseresults affirm that while RNA-based recovery is more efficient, it stillpreserves the underlying genomic distributions of insertions.Furthermore, because SRTs can be recovered from virtually any chromatinstate, RNA-based calling card recovery can be employed to analyze avariety of TFs with broad chromatin-binding preferences.

A common artifact observed in DNA-based transposon recovery is a largefraction of reads mapping back to the donor transposon plasmid insteadof the genome. Although this can be mitigated by long selection times orby digestion with the methyladenine-sensitive enzyme Dpnl, these methodsdo not completely eliminate background and are not compatible with allexperimental paradigms, in particular viral transduction. To reduce thisartifact, a hammerhead ribozyme was included in the SRT plasmiddownstream of the 5′ TR. Before transposition, the ribozyme will cleavethe nascent transcript originating from the marker gene, thus preventingRT. Transposition allows the SRT to escape the downstream ribozyme,leading to recovery of the self-reporting transcript. In a comparison ofDNA- and RNA-based recovery, about 15% of reads from the SP1-PBase DNAlibrary aligned to the plasmid, compared to fewer than 1% of reads fromthe RNA library (see e.g., FIG. 9D). Thus, the inclusion of aself-cleaving ribozyme virtually eliminates recovery of un-excisedtransposons.

SP1 Fused to piggyBac Directs SRT Insertions to SP1 Binding Sites

Next, it was confirmed that RNA calling cards, in bulk, can still beused to identify TF binding sites. 10-12 replicates of HCT-116 cellswere transfected with plasmids containing the PB-SRT-Puro donortransposon and SP1 fused to either piggyBac (SP1-PBase) or a hyperactivevariant of piggyBac (SP1-HyPBase). As controls, a similar number ofreplicates was also transfected with undirected PBase or HyPBase,respectively. 411,287 insertions from SP1-PBase and 1,523,169 insertionsfrom PBase were obtained. Similarly, 2,033,229 SP1-HyPBase insertionsand 5,779,101 insertions from HyPBase were obtained.

FIG. 5E and FIG. 12A show the redirection of SRT calling cards bySP1-PBase and SP1-HyPBase, respectively, to three representativeSP1-bound regions of the genome. Each circle in the insertions trackrepresents an individual transposition event whose genomic position ison the x-axis. The y-axis is the number of reads supporting eachinsertion on a log₁₀ scale. To better compare transposition rates acrosslibraries with different numbers of insertions, the normalized localinsertion rate was calculated and plotted as a density track. All threeof the loci depicted in FIG. 5E and FIG. 12A show a specific enrichmentof calling card insertions in the SP1 fusion experiments that is notobserved in the undirected control libraries. Next, peaks were called atall genomic regions enriched for SP1-directed transposition. The numberof insertions observed at significant peaks for both SP1-PBase andSP1—HyPBase was highly reproducible between biological replicates(R²=0.84 and 0.96, respectively; see e.g., FIG. 11A and FIG. 12B).Furthermore, calling card peaks were highly enriched for SP1 ChIP-seqsignal at their centers, both on average (see e.g., FIG. 11B and FIG.12C) and in aggregate (see e.g., FIG. 11C and FIG. 12D). SP1 is known topreferentially bind near TSSs and is also thought to play a role indemethylating CpG islands. Therefore, it was confirmed that theSP1-directed transposases preferentially inserted SRT calling cards nearTSSs, CpG islands, and unmethylated CpGs at statistically significantfrequencies (p<10⁻⁹ in each instance, G test of independence; see e.g.,FIG. 12D and FIG. 12E). Moreover, compared to undirected piggyBac,SP1-directed piggyBac showed a striking preference for depositinginsertions into promoters (see e.g., FIG. 10A and FIG. 10B). Lastly,regions targeted by SP1-PBase and SP1-HyPBase were enriched for thecanonical SP1 DNA binding motif (p<10-70 in each instance; see e.g.,FIG. 11E and FIG. 12F). Taken together, these results indicate that SP1can redirect piggyBac SRTs near SP1 binding sites.

Clustering of Undirected piggyBac Insertions Identifies BRD4-BoundSuper-Enhancers

Previous studies have shown that the undirected piggyBac transposasepreferentially inserts transposons near super-enhancers (SEs), a uniqueregulatory element that is thought to play a critical role in regulatingcell identity.

SEs are often enriched for the histone modification H3K27ac as well asRNA polymerase II and general transcription factors like the mediatorelement MED1 and the bromodomain protein BRD4. Moreover, the piggyBactransposase has a strong biophysical affinity for BRD4, as theseproteins can be co-immunoprecipitated. Given the millions of insertionsassayed from the undirected PBase and HyPBase controls in theSP1-directed experiments (see e.g., FIG. 5E and FIG. 12A), it may bepossible to identify BRD4-bound SEs simply from the localization ofundirected piggyBac transpositions.

Both undirected PBase and HyPBase showed non-uniform densities ofinsertions at loci bound by BRD4 (see e.g., FIG. 6A and FIG. 15). Atstatistically significant peaks of piggyBac calling cards, PBase andHyPBase showed high reproducibility of normalized insertions betweenbiological replicates (see e.g., FIG. 6B and FIG. 13B). Next, the meanBRD4 enrichment was calculated, as assayed by ChIP-seq, across thesepeaks. piggyBac peaks showed significantly increased BRD4 signalcompared to a genome-wide permutation of the peaks (p<10⁻⁹ in bothinstances, Kolmogorov-Smirnov test; see e.g., FIG. 6C and FIG. 13C).Maximum BRD4 ChIP-seq signal was observed at calling card peak centersand decreased symmetrically in both directions. It was also found thatpiggyBac peaks show striking ChIP-seq patterns for several histonemodifications, in particular an enrichment for H3K27ac ChIP-seq signal(see e.g., FIG. 6D and FIG. 13D). Since bromodomains bind acetylatedhistones, this observation further supports the hypothesis thatundirected piggyBac insertions can be used to map BRD4 binding. Thesepeaks were also enriched in H3K4me1, another canonical enhancer mark,and depleted for H3K9me3 and H3K27me3, modifications associated withrepressed chromatin. Taken together, these results demonstrate thatpiggyBac insertion density is highly correlated with BRD4 bindingthroughout the genome and that regions enriched for undirected piggyBacinsertions share features common to enhancers.

To assess whether piggyBac peaks can be used to identify BRD4-bound SEs,a reference list of Brd4-bound super-enhancers in HCT-116 cells wascreated (see e.g., FIG. 6A, FIG. 13A) from BRD4 ChIP-seq data.Receiver-operator characteristic curves were then constructed. These areshown for PBase- and HyPBase-derived BRD4-bound super-enhancers in FIG.6E and FIG. 13E. The high areas under the curves (0.98 in each instance)indicate that BRD4 super-enhancers from piggyBac transpositions can berobustly called. Calling card peaks are highly specific across a rangeof sensitivities. In addition, calling card peaks have high positivepredictive value (AUPRC=0.92 in each instance) across a broad range ofsensitivities (see e.g., FIG. 6F, FIG. 13F). Thus, undirected piggyBactranspositions are an accurate assay of BRD4-bound SEs.

To better understand the relationship between SE sensitivity and thenumber of insertions recovered, the data from the PBase and HyPBaseexperiments was downsampled in half-log increments (see e.g., FIG. 14Aand FIG. 14B). The resulting heatmaps indicate that sensitivityincreases with the total number of insertions recovered. Since thenumber of insertions future experiments will yield is difficult topredict, linear interpolation was also performed on the downsampleddata. The resulting contour plots (see e.g., FIG. 14C and FIG. 14D)indicate the approximate sensitivity of BRD4-bound SE detection inHCT-116 cells. These results suggest that even with as few as 10,000insertions, sensitivities around 50% can still be obtained.

Single Cell Calling Cards Enables Simultaneous Identification of CellType and Cell Type-Specific TF Binding Sites

Next, SRTs were recovered from scRNA-seq libraries. This can enableidentification of cell types from transcriptomic clustering and, usingthe same source material, profile TF binding in those cell types. The10× Chromium platform was adopted given its high efficiency of cell andtranscript capture as well as its ease of use. Like many microfluidicscRNA-seq approaches, the cell barcode and unique molecular index (UMI)are attached to the 3′ ends of transcripts. This poses a molecularchallenge for SRTs since the junction between the transposon and thegenome may be many kilobases away, precluding the use of high-throughputshort read sequencing. To overcome this barrier, a circularizationstrategy was developed to physically bring the cell barcode inapposition to the insertion site (see e.g., FIG. 7A).

A modified version of the bulk SRT amplification protocol was used,where primers were amplified that bound to the universal primingsequence next to the cell barcode and the terminal sequence of thepiggyBac TR. These primers were biotinylated and carried a 5′ phosphategroup. The PCR products of this amplification were diluted and allowedto self-ligate overnight. They were then sheared and captured withstreptavidin-coated magnetic beads. The rest of library was preparedon-bead and involved end repair, A-tailing, and adapter ligation. Afinal PCR step added the required Illumina sequences for high-throughputsequencing. The standard Illumina read 1 primer read the cell barcodeand UMI, while a custom read 2 primer, annealing to the end of thepiggyBac 5′ TR, read into the genome. Thus, both the location of apiggyBac insertion as well as its cell of origin were collected. Thismethod is referred to as single cell calling cards (scCC).

The method was validated by performing with a species-mixing experimentusing human HCT-116 cells and mouse N2a cells. Cells were mixed prior todroplet generation and the resulting emulsion was processed throughfirst strand synthesis. At this point, half of the RT product wasamplified according to the standard 10× protocol. The resultingscRNA-seq revealed strong species separation with an estimated multipletrate of 3.2% (see e.g., FIG. 16A). The remainder of the first strandsynthesis was used for the scCC protocol. The calling card analysis wasrestricted to those insertions whose cell barcodes were observed in thescRNA-seq library. The distribution of insertions across these cellsreflected a continuum from pure mouse to pure human (see e.g., FIG.16B-FIG. 16C). Since intramolecular ligation and subsequent PCR mayintroduce unwanted artifacts, such as mis-assignment of a barcode fromcell type A to an insertion site in cell type B, it was required that agiven insertion in a given cell must have at least two different UMIsassociated with it. Imposing this filter improved the number of puremouse and human cells (see e.g., FIG. 16D), yielding clear speciesseparation with an estimated multiplet rate of 7.8% (see e.g., FIG. 7B).This establishes that the method can map calling card insertions insingle cells.

It was then tested whether scCC could discern cell type-specific TFbinding. Two human cell lines, HCT-116 and K562, were transfected withHyPBase and PB-SRT-Puro and mixed together. The resulting scRNA-seqlibraries clearly identified the two major cell populations (see e.g.,FIG. 7C and FIG. 17A). scCC libraries were then prepared from thesecells and the cell barcodes from the HCT-116 and K562 clusters were usedto assign insertions to the two different cell types. 44,214 insertionswere obtained from 12,891 HCT-116 cells (mean 3.4 insertions per cell;mean 136 reads per insertion) and 132,994 insertions from 11,912 K562cells (mean 11 insertions per cell; mean 10³ reads per insertion). Thedistribution of insertions per cell varied by cell type (see e.g., FIG.17D) and does not appear to be correlated with differences in total RNAcontent (see e.g., FIG. 17B-FIG. 17C). Over 93% and 97% of HCT-116 andK562 cells, respectively, had at least one insertion event. Using scCCinsertion data alone, peaks were called and Brd4-bound loci that werespecific to HCT-116 cells, shared between HCT-116 and K562, and specificto K562 cells, were successfully identified, respectively (see e.g.,FIG. 7D). Both HCT-116 and K562 peaks showed statistically significantenrichment for BRD4 ChIP-seq signal (p<10⁻⁹ in both instances,Kolmogorov-Smirnov test; see e.g., FIG. 17E-FIG. 17F). From the earlierdownsampling analysis, it was estimated that with a p-value cutoff of10⁻⁹, the sensitivity for detecting Brd4-bound super-enhancers would beapproximately 60% (see e.g., FIG. 14D). The actual sensitivity at thislevel of recovery was 64%, indicating that downsampling analysis canreasonably estimate the performance of scCC. In all, these experimentsdemonstrate that scCC can be used to deconvolve cell type-specific TFbinding.

Since these Brd4 binding sites were identified using undirected HyPBase,it was also sought to confirm that TF-piggyBac fusions would still workwith scCC. HCT-116 cells were transfected with SP1-HyPBase and scRNA-seqwas performed. scCC libraries were made from these experiments and92,406 insertions were identified from 30,682 cells (mean 3 insertionsper cell; mean 129 reads per insertion). Over 84% of cells had at leastone insertion. The slight reduction in insertions per cell with the SP1fusion is consistent with previous studies, although the distribution ofinsertions recovered per cell was similar to that of the undirectedtransposase (see e.g., FIG. 17G). As was observed in bulk (see e.g.,FIG. 12A), SP1-HyPBase-directed insertions recovered from single cellslocalize to SP1 binding sites (see e.g., FIG. 7E). Finally, thereproducibility of the scCC method was tested. Both single cell HyPBaseand SP1-HyPBase showed high concordance between biological replicates atstatistically significant peaks (see e.g., FIG. 17H and FIG. 17I).Collectively, these experiments establish that scCC can be used toidentify cell type-specific binding sites of both bromodomain andDNA-binding TFs.

Discussion

Mapping TF binding in heterogeneous tissues is a challenging problembecause traditional methods combine signals from multiple cell typesinto a single, agglomerated profile. The difficulty is furthercompounded if individual cell types are difficult to identify, isolate,or are rare, precluding their study. Single cell RNA-seq is a promisingparadigm for handling such heterogeneity. Until now, it has beenimpossible to directly study the actions of individual TFs and connectthem to specific cell states. Here is presented a new method, singlecell calling cards (scCC), that enables simultaneous identification ofcell types and TF binding sites from complex mixtures and tissues. Thisis an important addition to the single cell repertoire and fills arecognized void in the field. It is anticipated that this technique willenable researchers to study the consequences of TF binding in a varietyof ex vivo and in situ models.

A concern with any transposon-based technique is the potential fordeleterious interruption of target genes leading to cell death andthereby false negatives. Previous experiments in diploid yeast foundthat calling cards are deposited into promoters of essential andnon-essential genes at comparable frequencies. Since mammalian genomeshave much larger intergenic regions than yeast, human and mice genomesare likely also able to tolerate calling card transpositions. Indeed,the fact that SRTs can be deposited in the developing mouse brain intoenhancers and super-enhancers suggests a small mutagenic burden.

One of the limitations of this technique is the relatively fewinsertions recovered on a per-cell basis, inflating the number of cellsthat must be analyzed to achieve good sensitivity. Previous studies havereported up to 15-30 insertions per cell for PBase, and likely higherfor HyPBase. Here, fewer insertions were recovered per cell than this,on average, in the experiments. This is likely due to the low capturerate of mRNA transcripts, which is common to all scRNA-seq methods. Theinclusion of cis-regulatory features known to enhance mRNA maturationand stability, such as the woodchuck hepatitis virus posttranscriptionalregulatory element (WPRE) may increase representation of SRTs inscRNA-seq libraries. Furthermore, as the transcript capture rates ofscRNA technologies improve, it is expected that the sensitivity of themethod will increase. The sensitivity of scCC can also be improved bysimply analyzing larger numbers of cells, such as with cell hashing orcombinatorial barcoding. Since the per-cell costs for scRNA-seq areexponentially falling, it is expected that scCC can be used to analyzeTF binding in even very rare cell types in the near future.Alternatively, SRTs can be combined with Cre.

The scCC experiments described here employed the piggyBac transposase,but for some applications, the use of other transposases may proveadvantageous. piggyBac inserts almost exclusively into TTAAtetranucleotides. For TFs that bind GC-rich regions or have highGC-content motifs, piggyBac fusions may have a difficult time findingnearby insertion sites. Sleeping Beauty, which inserts into TAdinucleotides, or To12, which does not have a strict insertion sitepreference, could be used to overcome these limitations.

However, the natural affinity of the piggyBac transposase for BRD4 makesit the ideal choice for the study of BRD4-bound SEs, which playimportant regulatory roles in development and disease. It is unclear whypiggyBac shows such an affinity. Recent evidence suggests that SEs formintranuclear liquid phase condensates and that SE-associated proteinslike MED1 and BRD4 have intrinsically disordered regions that may allowthem to form these condensates. It may be that piggyBac has a similarlydisordered domain that allows it to preferentially enter thesecondensates, thereby enriching SEs with insertions.

The defining feature of the scCC method is the self-reporting transposon(SRT). While herein the piggyBac and Sleeping Beauty SRTs are described,the self-reporting paradigm should be generalizable to any transposonlacking a polyadenylation signal (PAS) in at least one terminal repeat.Expanding the palette of SRTs will illuminate the genome-wide behaviorsof transposases and may yield further insight into chromatin dynamics.Simultaneous expression of many TFs, each tagged to a differenttransposase, may also enable multiplexed studies of TF binding in thesame cells. Mapping SRTs using cellular RNA appears to be substantiallymore efficient than the DNA-based inverse PCR method, but the reasonsfor this are somewhat unclear. Some efficiency is likely gained byeliminating self-ligation, as well as having multiple mRNA copies ofeach insertion to buffer against PCR artifacts. It is also unknown whatfraction of self-reporting transcripts are actually polyadenylated asopposed to merely containing A-rich genomic tracts. Non-genic PASsprevent anti-sense transcription, which suggests that PASs may be morecommon in the genome than previously appreciated. Targeted 3′-endsequencing of SRT libraries should help resolve this question, whilelong-read sequencing of self-reporting transcripts may identifynon-canonical PAS. Finally, SRTs could lead to new single celltransposon-based assays. For example, just as CRISPR/Cas9 has beencombined with scRNA-seq to read out the transcriptional effects of genedeletion, SRTs will allow transposon mutagenesis screens to be read outby scRNA-seq in a highly parallel fashion.

Finally, since calling card insertions are genomically integrated andpreserved through mitosis, they could serve as a molecular record ofcellular events. The use of an inducible transposase would enable therecording and identification of temporally-restricted TF binding sites.This would help uncover the stepwise order of events underlying theregulation of specific genes and inform cell fate decision making. Moregenerally, transposon insertions could serve as barcodes ofdevelopmental lineage. Single transposition events have been used todelineate relationships during hematopoiesis. Multiplexing several SRTsacross every cell in an organism could code lineage in a cumulative andcombinatorially diverse fashion, generating high-resolution cellularphylogenies.

Methods

Cell Culture

HCT-116, N2a, and HEK293T cells were cultured in Dulbecco's ModifiedEagle Medium (DMEM; Gibco #11965-084) supplemented with 10% fetal bovineserum (FBS; Peak Serum # PS-FB3) and 1% antibiotic-antimycotic(Anti-Anti; Gibco #15240-062). K562 cells were grown under the sameconditions as the HCT-116 and N2a except replacing DMEM with RPMI 1640Medium (Gibco #11875-085). Cells were grown at 37° C. with 5% carbondioxide (CO₂). Puromycin (Sigma # P8899) was added 24 hours aftertransfection at a final concentration of 2 μg/ml. Media was replenishedevery 2 days.

DNA—Vs RNA-Based Recovery

Approximately 500,000 HCT-116 cells were plated in a single well of a6-well plate. Cells were transfected with 2.5 μg of the SP1-PBaseplasmid (for a full list of plasmids, see TABLE 1) and 2.5 μg of thePB-SRT-Puro plasmid using Lipofectamine 3000 (Thermo Fisher # L3000015)following manufacturer's instructions.

TABLE 1 Plasmids referenced herein. Addgene Internal Accession PlasmidDescription Accession No. Addgene? No. PBase piggyBac pRM1024 NA NAtransposase HyPBase Hyperactive pRM1114 NA NA piggyBac transposase SP1-SP1 fused to pRM1023 NA NA PBase piggyBac SP1- SP1 fused to pRM1677 NANA HyPBase hyperactive piggyBac PB-SRT- piggyBac SRT pRM1304 NA NA Purowith puromycin reporter PB-SRT- piggyBac SRT pRM1535 NA NA tdTomato withtdTomato reporter SB100X Hyperactive pRM1137 NA NA Sleeping Beautytransposase SB-SRT- Sleeping Beauty pRM1668 NA NA Puro SRT withpuromycin reporter

After 24 hours, cells were split and plated 1:10 in each of three 10 cmdishes. Puromycin was then added and colonies were allowed to grow outunder selection for two weeks. Approximately 2,300 colonies wereobtained. All cells were pooled together and split into two populations:one was subjected to DNA extraction, self-ligation, and inverse PCR, asdescribed previously; while the other underwent RNA extraction and SRTlibrary preparation.

In Vitro Bulk Calling Card Experiments

10-12 replicates of HCT-116 cells were cotransfected with 5 μg ofPB-SRT-Puro plasmid and 5 μg PBase plasmid via Neon electroporation(Thermo Fisher # MPK10025). Each replicate contained 2×10⁶ cells. As anegative control, one replicate of HCT-116 cells was transfected with 5μg PB-SRT-Puro plasmid only. The following settings were used-pulsevoltage: 1,530 V; pulse width: 20 ms; pulse number: 1. Each replicatewas allowed to recover in a single well of a 6-well plate for 24 hoursbefore being split 1:1 into a 10 cm dish and adding puromycin. Cellswere grown under selection for one week, by which time almost allnegative control transfectants were dead. The same experimental setupwas used for experiments with PB-SRT-Puro and each of SP1-PBase,HyPBase, and SP1-HyPBase plasmids, as well as with SB-SRT-Puro andSB100× plasmids. Each replicate was cultured independently underaforementioned media conditions. After 7 days, each replicate wasdissociated with trypsin-EDTA (Sigma # T4049) and single cellsuspensions were created in phosphate-buffered saline (PBS; Gibco#14190-136). Aliquots of each replicate were cryopreserved in cellculture media supplemented with 5% DMSO. The remaining cells werepelleted by centrifugation at 300 g for 5 minutes. Cell pellets wereeither processed immediately or kept at −80° C. in RNAProtect CellReagent (QIAGEN #76526).

Isolation of Bulk RNA and Reverse Transcription

Total RNA was isolated from each replicate using the RNEasy Plus MiniKit (QIAGEN #74134) following manufacturer's instructions. Briefly, cellpellets were resuspended in 600 μl of Buffer RLT Plus with 1%P-mercaptoethanol (Gibco #21985-023). Cells were homogenized byvortexing. RNA was bound on gDNA Eliminator spin columns and treatedwith DNase (QIAGEN #79254) while on the column. After washing, RNA waseluted in 40 μl RNase-free H₂O. RNA was quantitated on a NanoDropND-1000 spectrophotometer (Thermo Fisher).

First strand synthesis on each replicate was performed with Maxima HMinus Reverse Transcriptase (Thermo Fisher # EP0752). 2 μg of total RNAwas mixed with 1 μl 10 mM dNTPs (Clontech #639125) and 1 μl of 50 μMSMART_dT18VN primer (for a complete list of primer sequences, see TABLE2), which brought the total volume up to 14 μl, which was then incubatedit at 65° C. for 5 minutes.

TABLE 2 Oligonucleotides referenced in this work. Primer Primer SequencePurification Notes SMART_dT18VN AAGCAGTGGTATCAA StandardRT primer for bulk (SEQ ID NO: 13) CGCAGAGTACGTTTT desaltRNA calling card TTTTTTTTTTTTTTTTT recovery TTTTT TTTTTVN SMARTAAGCAGTGGTATCA Standard PCR primer for (SEQ ID NO: 14) ACGCAGAGT desaltbulk RNA calling card amplification SRT_PAC_F1 CAACCTCCCCTTCT StandardPuromycin (SEQ ID NO: 15) ACGAGC desalt marker in SRT SRT_tdTomato_TCCTGTACGGCATG Standard tdTomato marker F1 GACGAG desalt in SRT(SEQ ID NO: 16) Raff_ACTB_F CCTCGCCTTTGCCG Standard Human ACTB(SEQ ID NO: 17) ATCCG desalt primer (for RT control) Raff_ACTB_RGGATCTTCATGAGG Standard Human ACTB (SEQ ID NO: 18) TAGTCAGTCAGGTC desaltprimer (for RT C control) OM-PB-ACG AATGATACGGCGAC Standard For use with(SEQ ID NO: 19) CACCGAGATCTACAC desalt piggyBac SRTs TCTTTCCCTACACGACGCTCTTC CGATCTACGTTTAC GCAGACTATCTTTC TAG OM-PB-CTA AATGATACGGCGACStandard For use with (SEQ ID NO: 20) CACCGAGATCTACAC desaltpiggyBac SRTs TCTTTCCCTACACGA CGCTCTTC CGATCTCTATTTAC GCAGACTATCTTTC TAGOM-PB-GAT AATGATACGGCGAC Standard For use with (SEQ ID NO: 21)CACCGAGATCTACAC desalt piggyBac SRTs TCTTTCCCTACACGA CGCTCTTCCGATCTGATTTTAC GCAGACTATCTTTC TAG OM-PB-TGC AATGATACGGCGAC StandardFor use with (SEQ ID NO: 22) CACCGAGATCTACAC desalt piggyBac SRTsTCTTTCCCTACACGA CGCTCTTC CGATCTTGCTTTAC GCAGACTATCTTTC TAG OM-PB-TAGAATGATACGGCGAC Standard For use with (SEQ ID NO: 23) CACCGAGATCTACACdesalt piggyBac SRTs TCTTTCCCTACACGA CGCTCTTC CGATCTTAGTTTACGCAGACTATCTTTC TAG OM-PB-ATC AATGATACGGCGAC Standard For use with(SEQ ID NO: 24) CACCGAGATCTACAC desalt piggyBac SRTs TCTTTCCCTACACGACGCTCTTC CGATCTATCTTTAC GCAGACTATCTTTC TAG OM-PB-CGT AATGATACGGCGACStandard For use with (SEQ ID NO: 25) CACCGAGATCTACAC desaltpiggyBac SRTs TCTTTCCCTACACGA CGCTCTTC CGATCTCGTTTTAC GCAGACTATCTTTC TAGOM-PB-GCA AATGATACGGCGAC Standard For use with (SEQ ID NO: 26)CACCGAGATCTACAC desalt piggyBac SRTs TCTTTCCCTACACGA CGCTCTTCCGATCTGCATTTAC GCAGACTATCTTTC TAG OM-SB-ACG AATGATACGGCGAC StandardFor use with (SEQ ID NO: 27) CACCGAACACTCTTT desalt Sleeping BeautyCCCTACACGACGCT SRTs CTTCCGATC TACGTAAGTGTATG TAAACTTCCGACTT CAAOM-SB-CTA AATGATACGGCGAC Standard For use with (SEQ ID NO: 28)CACCGAACACTCTTT desalt Sleeping Beauty CCCTACACGACGCT SRTs CTTCCGATCTCTATAAGTGTATGT AAACTTCCGACTTC AA OM-SB-GAT AATGATACGGCGAC StandardFor use with (SEQ ID NO: 29) CACCGAACACTCTTT desalt Sleeping BeautyCCCTACACGACGCT SRTs CTTCCGATC TGATTAAGTGTATG TAAACTTCCGACTT CAAOM-SB-TGC AATGATACGGCGAC Standard For use with (SEQ ID NO: 30)CACCGAACACTCTTT desalt Sleeping Beauty CCCTACACGACGCT SRTs CTTCCGATCTTGCTAAGTGTATG TAAACTTCCGACTT CAA OM-SB-TAG AATGATACGGCGAC StandardFor use with (SEQ ID NO: 31) CACCGAACACTCTTT desalt Sleeping BeautyCCCTACACGACGCT SRTs CTTCCGATC TTAGTAAGTGTATG TAAACTTCCGACTT CAAOM-SB-ATC AATGATACGGCGAC Standard For use with (SEQ ID NO: 32)CACCGAACACTCTTT desalt Sleeping Beauty CCCTACACGACGCT SRTs CTTCCGATCTATCTAAGTGTATGT AAACTTCCGACTTC AA OM-SB-CGT AATGATACGGCGAC StandardFor use with (SEQ ID NO: 33) CACCGAACACTCTTT desalt Sleeping BeautyCCCTACACGACGCT SRTs CTTCCGATC TCGTTAAGTGTATG TAAACTTCCGACTT CAAOM-SB-GCA AATGATACGGCGAC Standard For use with (SEQ ID NO: 34)CACCGAACACTCTTT desalt Sleeping Beauty CCCTACACGACGCT SRTs CTTCCGATCTGCATAAGTGTATG TAAACTTCCGACTT CAA N7 indexed CAAGCAGAAGACG StandardUniquely identifies primer GCATACGAGAT[index] desalt each bulk RNA(SEQ ID NO: 35 GTCTCGTGGGCTC calling card library [index] SEQ ID GGin conjunction with NO: 36) barcoded transposon primer 10x_TSOAAGCAGTGGTATCA Standard For continuing 10x (SEQ ID NO: 37)ACGCAGAGTACATr desalt scRNA-seq prep GrGrG after splitting firstRT product in half Bio_Illumina_Seq1_ /5Phos/ACACTCTTT HPLCSingle cell calling scCC_10X_3xPT CCC/iBiodT/ACACGA card primer for(SEQ ID NO: 38) CGCTCTTCCGA*T*C use with 10x *T Chromium 3′ v2 kitBio_Long_PB_ /5Phos/GCGTCAATT HPLC Single cell calling LTR_3xPTTTACGCAGAC/ card primer for (SEQ ID NO: 39) iBiodT/ use withATCTTTC*T*A*G piggyBac SRTs scCC_P5_adapter AATGATACGGCGAC StandardAdapter for scCC (SEQ ID NO: 40) CACCGAGATCTTCAC desalt(needs to be pre- TCATTCCACACGACT annealed with CCTTGCCA GTCTC*TscCC_P7_adapter) scCC_P7_adapter /5Phos/GAGACTGGC StandardAdapter for scCC (SEQ ID NO: 41 AAGTACACGTCGCAC desalt (needs to be pre-[index]SEQ ID TCACCATGA[index] annealed with NO: 42) ATCTCscCC_P5_adapter) GTATGCCGTCTTCT GCTTG scCC_P5_primer AATGATACGGCGACStandard For final scCC (SEQ ID NO: 43) CACCGAGATC desalt library PCRscCC_P7_primer CAAGCAGAAGACG Standard For final scCC (SEQ ID NO: 44)GCATACGAGAT desalt library PCR scCC_PB_Custom CGTGTAGGGAAAGA PAGEFor custom Read2 GTGTGCGTCAATTT sequencing of (SEQ ID NO: 45)TACGCAGACTATCT piggyBac scCC TTCTAG libraries; read 2 should begin withGGTTAA scCC_CustomInd GAGACTGGCAAGTA PAGE For custom ex1 (SEQ ID NO:CACGTCGCACTCAC sequencing of 46) CATGA scCC libraries

After transferring to ice and letting rest for 1 minute, 4 μl 5× MaximaRT Buffer, 1 μl RNaseOUT (Thermo Fisher #10777019), and 1 μl of 1:1Maxima H Minus Reverse Transcriptase diluted in 1×RT Buffer (100 U) wereadded. The solution was mixed by pipetting and incubated at 50° C. for 1hour followed by heat inactivation at 85° C. for 10 minutes. Finally,the mixture was digested with 1 μl RNaseH (NEB # M0297S) at 37° C. for30 minutes. cDNA was stored at −20° C.

Amplification of Self-Reporting Transcripts from Bulk RNA

The PCR conditions for amplifying self-reporting transcripts (i.e.,transcripts derived from self-reporting transposons) involved mixing 1μl cDNA template with 12.5 μl Kapa HiFi HotStart ReadyMix (KapaBiosystems # KK2601), 0.5 μl 25 μM SMART primer, and either 1 μl of 25μM SRT_PAC_F1 primer (in the case of puromycin selection) or 0.5 μl of25 μM SRT_tdTomato_F1 primer (in the case of tdTomato screening). Themixture was brought up to 25 μl with ddH2O. Thermocycling parameterswere as follows: 95° C. for 3 minutes; 20 cycles of: 98° C. for 20seconds−65° C. for 30 seconds−72° C. for 5 minutes; 72° C. for 10minutes; hold at 4° C. forever. As a control, cDNA quality can beassessed with exon-spanning primers for R-actin (see TABLE 2 forexamples of human primers) under the same thermocycling settings.

PCR products were purified using AMPure XP beads (Beckman Coulter #A63880). 12 μl of resuspended beads were added to the 25 μl PCR productand mixed homogenously by pipetting. After a 5-minute incubation at roomtemperature, the solution was placed on a magnetic rack for 2 minutes.The supernatant was aspirated and discarded. The pellet was washed twicewith 200 μl of 70% ethanol (incubated for 30 seconds each time),discarding the supernatant each time. The pellet was left to dry at roomtemperature for 2 minutes. To elute, 20 μl ddH₂O was added to thepellet, the pellet was resuspended by pipetting, and the mixture wasincubated at room temperature for 2 minutes, and placed on a magneticrack for one minute. Once clear, the solution was transferred to a clean1.5 ml tube. DNA concentration was measured on the Qubit 3.0 Fluorometer(Thermo Fisher # Q33216) using the dsDNA High Sensitivity Assay Kit(Thermo Fisher # Q32851).

Generation of Bulk RNA Calling Card Libraries

Calling card libraries from bulk RNA were generated using the Nextera XTDNA Library Preparation Kit (Illumina # FC-131-1024). One nanogram ofPCR product was resuspended in 5 μl ddH₂O. To this mixture 10 μl TagmentDNA (TD) Buffer and 5 μl Amplicon Tagment Mix (ATM) were added. Afterpipetting to mix, the solution was incubated in a thermocycler preheatedto 55° C. The tagmentation reaction was halted by adding 5 μlNeutralization Tagment (NT) Buffer and was kept at room temperature for5 minutes. The final PCR was set up by adding 15 μl Nextera PCR Mix(NPM), 8 μl ddH₂O, 1 μl of 10 μM transposon primer (e.g., OM-PB-NNN) and1 μl Nextera N7 indexed primer. The transposon primer anneals to the endof the transposon terminal repeat-piggyBac, in the case of OM-PBprimers, or Sleeping Beauty, in the case of OM-SB primers- and containsa 3 base pair barcode sequence. Every N7 primer contains a unique indexsequence that is demultiplexed by the sequencer. Each replicate wasassigned a unique combination of barcoded transposon primer and indexedN7 primer, enabling precise identification of each library's sequencingreads.

The final PCR was run under the following conditions: 95° C. for 30seconds; 13 cycles of: 95° C. for 10 seconds−50° C. for 30 seconds−72°C. for 30 seconds; 72° C. for 5 minutes; hold at 4° C. forever. AfterPCR, the final library was purified using 30 μl (0.6×) AMPure XP beads,as described above. The library was eluted in 11 μl ddH₂O andquantitated on an Agilent TapeStation 4200 System using the HighSensitivity D1000 ScreenTape (Agilent #5067-5584 and #5067-5585).

Sequencing and Analysis of Bulk RNA Calling Card Libraries

Multiple calling card libraries were pooled together for sequencing onthe Illumina HiSeq 2500 platform. To increase the complexity of thelibrary, PhiX was added at a final loading concentration of 50%. Readswere demultiplexed by the N7 index sequences added during the final PCR.Read 1 began with the 3 base pair barcode followed by the end of thetransposon terminal repeat, culminating with the insertion site motif(TTAA in the case of piggyBac; TA in the case of Sleeping Beauty) beforeentering the genome. piggyBac reads were checked for exact matches tothe barcode, transposon sequence, and insertion site at the beginning ofreads before being hard trimmed using cutadapt with the followingsettings: —g “{circumflex over ( )}NNNTTTACGCAGACTATCTTTCTAGGGTTAA” (SEQID NO: 47)—minimum-length 1—discard-untrimmed—e 0—no—indels, where NNNis replaced with the primer barcode. Sleeping Beauty libraries weretrimmed with the following settings: —g“^(A)NNNTAAGTGTATGTAAACTTCCGACTTCAACTGTA” (SEQ ID NO: 48)—minimum-length1—discard—untrimmed—e 0—no-indels. Reads passing this filter were thentrimmed of any trailing Nextera adapter sequence, again using cutadaptand the following settings: —a“CTGTCTCTTATACACATCTCCGAGCCCACGAGACTNNNNNNNNNNTCTCGT ATGCCGTCTTCTGCTTG”(SEQ ID NO: 49)—minimum-length 1. The remaining reads were aligned tothe human genome (build hg38) with Novoalign 3 (Novocraft Technologies)and the following settings: —n 40—o SAM—o SoftClip. Aligned reads werevalidated by confirming that they mapped adjacent to the insertion sitemotif. Successful reads were then converted to calling card format(.ccf) visualized on the WashU Epigenome Browser v46.

In Vitro Single Cell Calling Card Experiments

N2a and K562 cells were cultured and transfected identically as HCT-116cells, with the following exceptions: K562 cells were grown in RPMI 1640Medium (Gibco #11875-085); for K562 cells, Neon electroporation settingswere-pulse voltage: 1,450 V; pulse width: 10 ms; pulse number: 3; forN2a cells, Neon electroporation settings were-pulse voltage: 1,050 V;pulse width: 30 ms; pulse number: 2. For N2a cells, one replicate (2×10⁶cells) was transfected with 5 μg PB-SRT-Puro and 5 μg HyPBase, whileanother replicate was transfected with 5 μg PB-SRT-Puro only. For K562cells, 4 replicates received both plasmids and one received the SRTalone. After 1 week of selection, N2a or K562 cells were mixed withtransfected HCT-116 cells and then underwent single cell RNA-seq librarypreparation. For the species mixing experiment, cells were classified aseither human or mouse if at least 80% of self-reporting transcripts inthat cell mapped to the human or mouse genome, respectively.

Single Cell RNA-Seq Library Preparation

Single cell RNA-seq libraries were prepared using 10× Genomics' ChromiumSingle Cell 3′ Library and Gel Bead Kit (v2 chemistry; #120267). Eachreplicate was targeted for recovery of 6,000 cells. Library preparationfollowed a modified version of the manufacturer's protocol. The SingleCell Master Mix was prepared without RT Primer and replaced by anequivalent volume of Low TE Buffer. GEM generation and GEM-RT incubationproceeded as instructed. At the end of Post GEM-RT cleanup, 36.5 μlElution Solution I was added and 36 μl of the eluted sample wastransferred to a new tube (instead of 35.5 μl and 35 μl, respectively).The eluate was split into two 18 μl aliquots and kept at −20° C. untilready for further processing. One fraction was kept for single cellcalling cards library preparation, while the other half was furtherprocessed into a single cell RNA-seq library.

The RT Primer sequence was then added to the products in the scRNA-seqaliquot. An RT master mix was created by adding 20 μl of Maxima 5×RTBuffer, 20 μl of 20% w/v Ficoll PM-400 (GE Healthcare #17030010), 10 μlof 10 mM dNTPs (Clontech #639125), 2.5 μl RNase Inhibitor (Lucigen), and2.5 μl of 100 μM 10×_TSO. To this solution 18 μl of the first RT productand 22 μl of ddH₂O were added. Finally, 5 μl Maxima H Minus ReverseTranscriptase was added, mixed by flicking, and centrifuged briefly.This reaction was incubated at 25° C. for 30 minutes followed by 50° C.for 90 minutes and heat inactivated at 85° C. for 5 minutes. Thesolution was purified using DynaBeads MyOne Silane (Thermo Fisher#37002D) following 10× Genomics' instructions, beginning at “Post GEM-RTCleanup—Silane DynaBeads” step D. The remainder of the single cellRNA-seq protocol, including purification, amplification, fragmentation,and final library amplification, followed manufacturer's instructions.

Single Cell Calling Cards Library Preparation

To amplify self-reporting transcripts from single cell RNA-seqlibraries, 9 μl of RT product (the other half was kept in reserve) wasadded to 25 μl Kapa HiFi HotStart ReadyMix and 15 μl ddH₂O. A PCR primercocktail was then prepared comprising 5 μl of 100 μMBio_llumina_Seq1_scCC_10×_3×PT primer, 5 μl of 100 μMBio_Long_PB_LTR_3×PT, and 10 μl of 10 mM Tris-HCl, 0.1 mM EDTA buffer(IDT #11-05-01-13). One μl of this cocktail was added to the PCR mixtureand placed in a thermocycler (Eppendorf MasterCycler Pro). Thermocyclingsettings were as follows: 98° C. for 3 minutes; 20-22 cycles of 98° C.for 20 seconds−67° C. for 30 seconds−72° C. for 5 minutes; 72° C. for 10minutes; 4° C. forever. PCR purification was performed with 30 μl AMPureXP beads (0.6× ratio) as described previously. The resulting library wasquantitated on an Agilent TapeStation 4200 System using the HighSensitivity D5000 ScreenTape (Agilent #5067-5592 and #5067-5593).

Single cell calling card library preparation was performed using theNextera Mate Pair Sample Prep Kit (Illumina # FC-132-1001) withmodifications to the manufacturer's protocol. The library wascircularized by bringing 300 fmol (approximately 200 ng) of DNA up to afinal volume of 268 μl with ddH₂O, then adding 30 μl CircularizationBuffer 10× and 2 μl Circularization Ligase (final concentration: 1 nM).This reaction was incubated overnight (12-16 hours) at 30° C. Afterremoval of linear DNA (following manufacturer's instructions), thelibrary on a Covaris E220 Focused-ultrasonicator was sheared with thefollowing settings-peak power intensity: 200; duty factor: 20%; cyclesper burst: 200; time: 40 seconds; temperature: 6° C.

The library preparation proceeded per manufacturer's instructions untiladapter ligation. Custom adapters were designed (see e.g., TABLE 2) sothat the standard Illumina sequencing primers would not interfere withthe library. Adapters were prepared by combining 4.5 μl of 100 μMscCC_P5_adapter, 4.5 μl of 100 μM scCC_P7_adapter, and 1 μl of NEBuffer2 (NEB # B7002S), then heating in a thermocycler at 95° C. for 5minutes, then holding at 70° C. for 15 minutes, then ramping down at 1%until it reached 25° C., holding at the temperature for 5 minutes,before keeping at 4° C. forever. One μl of this custom adapter mix wasused in place of the manufacturer's recommended DNA Adapter Index. Theligation product was cleaned per manufacturer's instructions. For thefinal PCR, the master mix was created by combining 20 μl Enhanced PCRMix with 28 μl of ddH₂O and 1 μl each of 25 μM scCC_P5_primer and 25 μMscCC_P7_primer). This was then added to the streptavidin bead-bound DNAand amplified under the following conditions: 98° C. for 30 seconds; 15cycles of: 98° C. for 10 seconds−60° C. for 30 seconds−72° C. for 2minutes; 72° C. for 5 minutes; 4° C. forever. All of the PCR supernatantwas transferred to a new tube and purified with 35 μl (0.7×) AMPure XPbeads following manufacturer's instructions. The final library waseluted in 25 μl Elution Buffer (QIAGEN #19086) and quantitated on anAgilent TapeStation 4200 System using the High Sensitivity D1000ScreenTape.

Sequencing and Analysis of scRNA-Seq Libraries

scRNA-seq libraries were sequenced on either Illumina HiSeq 2500 orNovaSeq S1 machines. Reads were analyzed using 10× Genomics' cellranger2.1.0 with the following settings:—expect-cells=6000—chemistry=SC3Pv2—localcores=16—localmem=30. Thedigital gene expression matrices from 10× werethen further processedwith scanpy 1.3.7 for identification of highly variable genes,dimensionality reduction, and Louvain clustering. The species-mixinganalysis was analyzed using Drop-seq_tools 1.11.

Sequencing and Analysis of scCC Libraries

scCC libraries were sequenced on Illumina NextSeq 500 machines (v2Reagent Cartridges) with 50% PhiX. The standard Illumina primers wereused for read 1 and index 2 (BP10 and BP14, respectively), and customprimers for read 2 and index 1 (see e.g., TABLE 2). Read 1 sequenced thecell barcode and unique molecular index of each self-reportingtranscript. Read 2 began with GGTTAA (end of the piggyBac terminalrepeat and insertion site motif) before continuing into the genome.Reads containing this exact hexamer were trimmed using cutadapt with thefollowing settings: —g “{circumflex over ( )}GGTTAA”—minimum-length1—discard-untrimmed—e 0—no-indels. Reads passing this filter were thentrimmed of any trailing P7 adapter sequence, again using cutadapt andwith the following settings: —a“AGAGACTGGCAAGTACACGTCGCACTCACCATGANNNNNNNNNATCTCGT ATGCCGTCTTCTGCTTG”(SEQ ID NO: 50)—minimum-length 1. Reads passing these filters werealigned using 10× Genomics' cellranger with the following settings:—expect-cells=6000—nosecondary—chemistry=SC3Pv2—localcores=16—localmem=30.This workflow also managed barcode validation and collapsing of UMIs.Aligned reads were validated by verifying that they mapped adjacent toTTAA tetramers. Reads were then converted to calling card format (.ccf,see above). Finally, to minimize the presence of intermolecularartifacts, it was required that each insertion must have been tagged byat least two different UMIs. The set of validated cell barcodes fromeach scRNA-seq library was used to demultiplex library-specific barcodedinsertions from the scCC data. This approach requires no shared cellbarcodes between scCC (and scRNA-seq) libraries. As a result, insertionsfrom non-unique cell barcodes were excluded, which represented a verysmall number of total cells lost (<1% per multiplexed library).

Peak Calling

Peaks were called in calling card data using Bayesian blocks, anoise-tolerant algorithm for segmenting discrete, one-dimensional data,using the astroML 0.3 implementation. Bayesian blocks segments thegenome into non-overlapping blocks where the density of calling cardinsertions is uniform. By comparing the segmentation against abackground model, Poisson statistics were used to assess whether a givenblock shows statistically significant enrichment for insertions. LetB={b₁, b₂, . . . b_(n)} represent the set of blocks found by performingBayesian block segmentation on all insertions from a TF-directedexperiment (e.g., SP1-PBase). For each block b_(i), let x_(i) be thenumber of insertions in that block in the TF-directed experiment.Similarly, let y_(i)′ be the number of insertions in that block in theundirected experiment (e.g., PBase) normalized to the total number ofinsertions found in the TF-directed experiment. Then, for each block thePoisson p-value of observing at least x_(i) insertions assuming aPoisson distribution with expectation y_(i)′: P(k≥x_(i)|λ=y_(i)′) wascalculated. All blocks that were significant beyond a particular p-valuethreshold were accepted.

For bulk analysis of SP1-PBase and SP1-HyPBase insertions, a pseudocountof 0.1 was added to all blocks and p-value cutoffs of 10⁻⁶ and 10⁻²²were used, respectively. For single cell analysis of SP1-HyPBaseinsertions, a pseudocount of 1 was added to all blocks and a p-valuecutoff of 10⁻⁹ was used. All three of these values were beyond aBonferroni-corrected a of 0.05. Peak calls were polished by mergingstatistically-significant blocks that were within 250 bases of eachother and by aligning block edges to coincide with TTAAs.

To identify BRD4 binding sites from undirected piggyBac insertions,those insertions were segmented using Bayesian blocks. For each blockb_(i), x_(i) denotes the number of undirected insertions in that block.x_(i)′, the expected number of insertions in block b_(i) was alsocalculated, assuming piggyBac insertions were distributed uniformlyacross the genome. This was done by dividing the total number ofmappable TTAAs in the genome by the total number of undirectedinsertions, then multiplying this value by the number of mappable TTAAsin block b_(i). Then, for each block the Poisson p-valueP(k≥x_(i)|λ=x_(i)′) was calculated. All blocks that were significantbeyond a particular p-value threshold were accepted. Finally,statistically significant blocks were merged that were within 12,500bases of each other.

For the bulk PBase and HyPBase analysis, p-value cutoffs of 10⁻³⁰ and10⁻⁶² were used, respectively. For both in vitro and in vivo single cellHyPBase analyses, a p-value cutoff of 10⁻⁹ was used. To calldifferentially-bound loci between upper and lower cortical layerneurons, the same framework as described above for SP1 was used but withreciprocal enrichment analyses where the upper layer insertions wereused as the “experiment” track and the lower layer insertions were usedas the “control” track, and vice-versa. Here again a p-value cutoff of10⁻⁹ was used.

SP1 Binding Analysis in HCT-116 Cells

The SP1 peak calls were compared to a publicly-available ChIP-seqdataset as well as an input control file (see e.g., TABLE 3).

TABLE 3 External ChIP-seq datasets referenced in this work Target Celltype Source Accession Control File DOI SP1 HCT-116 ENCODE ENCFF000PCTENCFF000PBO BRD4 HCT-116 Publication SRR2481799 SRR2481800 0.1172/JCI83265 H3K27ac HCT-116 ENCODE ENCFF082JPN, ENCFF048ZOQ, ENCFF176BXCENCFF827YXC H3K4me1 HCT-116 ENCODE ENCFF088BWP, ENCFF048ZOQ, ENCFF804MJIENCFF827YXC H3K9me3 HCT-116 ENCODE ENCFF578MDZ, ENCFF048ZOQ, ENCFF033XOGENCFF827YXC H3K27me3 HCT-116 ENCODE ENCFF281SBT, ENCFF048ZOQ,ENCFF124GII ENCFF827YXC CTCF HCT-116 ENCODE ENCFF000OZC ENCFF000PBOH3K9me2 HCT-116 ENCODE ENCFF760OZN, ENCFF048ZOQ, ENCFF565FDP ENCFF827YXCH3K36me3 HCT-116 ENCODE ENCFF850EAH, ENCFF048ZOQ, ENCFF312RKBENCFF827YXC H4K20me1 HCT-116 ENCODE ENCFF070JDY, ENCFF048ZOQ,ENCFF334HHB ENCFF827YXC H3K4me2 HCT-116 ENCODE ENCFF936MMN, ENCFF048ZOQ,ENCFF937OOL ENCFF827YXC H3K4me3 HCT-116 ENCODE ENCFF183OZI, ENCFF048ZOQ,ENCFF659FPR ENCFF827YXC H3K9ac HCT-116 ENCODE ENCFF408RRT ENCFF413RQGH3K79me2 HCT-116 ENCODE ENCFF865KPW, ENCFF048ZOQ, ENCFF947YPUENCFF827YXC BRD4 K562 ENCODE ENCFF335PHG ENCFF000BWK H3K27ac K562 ENCODEENCFF000BXH ENCFF000BWK H3K27ac Mouse Publication SRR6129714 SRR61296950.1016/ cortex j.cell.2017.09.047

See below for more details on aligning and analyzing ChIP-seq data. Alist of unique TSSs was collated by taking the 5′-most coordinates ofRefSeq Curated genes in the hg38 build (UCSC Genome Browser). A list ofCpG islands in HCT-116 cells and their methylation statuses were derivedfrom previously-published Methyl-seq data. The liftOver tool (UCSC) wasused to convert coordinates from hg18 to hg38. Enrichment inSP1-directed insertions was tested at TSSs, CpG islands, andunmethylated CpGs with the G test of independence. For motif discoveryused MEME-ChIP 4.11.2 was used with a dinucleotide shuffled control andthe following settings: —dna-nmeme 600—seed 0—ccut 250—meme-modzoops—meme-minw 4—meme-nmotifs 5.

BRD4 Sensitivity, Specificity, and Precision Analysis in HCT-116 Cells

A published BRD4 ChIP-seq dataset was used to identify BRD4-boundsuper-enhancers in HCT-116 cells, following previously-describedmethods. Peaks were first called using MACS 1.4.1 at p-<10⁻⁹, then thislist was fed into ROSE 0.1. Artifactual loci less than 2,000 bp in sizewere discarded, yielding a final list of 162 super-enhancers. Toevaluate sensitivity, bedtools 2.27.1 was used to ask what fraction ofpiggyBac peaks, at various p-value thresholds, overlapped the set ofBRD4-bound super-enhancers. To measure specificity, a list of regionspredicted to be insignificantly enriched (p>0.1) was created for BRD4ChIP-seq signal. Bases from this region were then sampled such that thedistribution of peak sizes was identical to that of the 162super-enhancers. Sampling to 642× coverage was then performed,sufficient to cover each base with one peak, on average. The fraction ofthe piggyBac peaks that overlapped these negative peaks was thendetermined and subtracted from 1 to obtain specificity. Finally,precision, or positive predictive value, was calculated by dividing thetotal number of detected super-enhancer peaks by the sum of thesuper-enhancer peaks and the false positive peaks.

Downsampling and Replication Analysis

When performing downsampling analyses on calling card insertions,insertions were randomly sampled without replacement and in proportionto the number of reads supporting each insertion. Peaks were called onthe downsampled insertions at a range of p-value cutoffs. Linearinterpolation was performed using numpy 1.15 and visualized usingmatplotlib 3.0. Replication was assessed by splitting calling cardinsertions into two, approximately equal, files based on their barcodesequences. Each new file was treated as a single biological experiment.For each peak called from the joint set of all insertions, the number ofnormalized insertions (insertions per million mapped insertions, or IPM)was plotted in one replicate on the x-axis and the other replicate ony-axis.

ChIP-Seq and Chromatin State Analyses

Raw reads were aligned using Novoalign with the following settings forsingle-end datasets: —o SAM—o SoftClip, while paired-end datasets weremapped with the additional flag—i PE 200-500. To calculate and visualizethe fold enrichment in ChIP-seq signal at calling card peaks, deeptools3.0.1 was used. Significant mean enrichment was tested for in BRD4ChIP-seq signal at piggyBac peaks over randomly shuffled control peakswith the Kolmogorov-Smirnov test. Chromatin state analysis was performedusing ChromHMM 1.15 as previously described. For each chromatin state,mean and standard deviation of the rate of normalized insertions perkilobase (IPM/kb) were plotted.

SRT-tdTomato Fluorescence Validation

To test the fluorescence properties of the SRT-tdTomato construct, K562cells were transfected as previously described with either 1 μg of PUC19plasmid; 0.5 μg of PB-SRT-tdTomato plasmid and 0.5 μg PUC19; 0.5 μg ofPB-SRT-tdTomato and 0.5 μg PBase plasmid; and 0.5 μg of PB-SRT-tdTomatoand 0.5 μg HyPBase plasmid. Cells were allowed to expand for 8 days,after which fluorescence activity was assayed on an Attune NxT FlowCytometer (Thermo Fisher) with an excitation wavelength of 561 nm. Flowcytometery data were visualized using FlowCal 1.2.0. Bulk RNA callingcards were assayed on HEK293T cells transfected with SRT-tdTomato withor without HyPBase plasmid. While these cells were not sorted based onfluorescence activity, the SRT library from cells transfected with bothSRT and transposase were more complex and contained many more insertionsthan the library from cells receiving SRT alone (see e.g., FIG. 9A).

In Vivo Single Cell Calling Cards Experiments

All mouse experiments were done following procedures previouslydescribed. In brief, the PB-SRT-tdTomato and HyPBase constructs werecloned into AAV vectors. The Hope Center Viral Vectors Core atWashington University in St. Louis packaged each construct in AAV9capsids. Titers for each virus ranged between 1.1×10¹³ and 2.2×10¹³viral genomes/ml. Equal volumes of each virus were mixed andintracranial cortical injections of the mixture into newborn wild-typeC57BL/6J pups (P0-2) were performed. As a gating control, onelitter-matched animal was injected with AAV9-PB-SRT-tdTomato only. After2 to 4 weeks, the mice were sacrificed and the cortex (8 libraries) orhippocampus (1 library) dissected.

Tissues were dissociated to single suspensions following a modificationof previously published methods. Samples were incubated in a papainsolution containing Hibernate-A (Gibco # A1247501) with 5% v/v trehalose(Sigma-Aldrich # T9531), 1×B-27 Supplement (Gibco #17504044), 0.7 mMEDTA (Corning #36-034-CI), 70 μM 2-mercaptoethanol (Gibco #21985023),and 2.8 mg/ml papain (Worthington Chemical Corporation # LS003118).After incubation at 37° C., cells were treated with DNasel (WorthingtonChemical Corporation # NC9924263), triturated through increasinglynarrow fire-polished pipettes, and passed through a 40-micron filterprewetted with resuspension solution: Hibernate-A containing 5% v/vtrehalose, 0.5% Ovomucoid Trypsin Inhibitor (Worthington ChemicalCorporation # NC9931428), 0.5% Bovine Serum Albumin (BSA; Sigma-Aldrich# A9418), 33 μg/ml DNasel, and 1×B-27 Supplement. The filter was washedwith 6 ml of resuspension solution. The resulting suspension wascentrifuged for 4 minutes at 250 g. The supernatant was discarded. Thepellet was then resuspended in 2 ml of resuspension solution andresuspended by gentle pipetting.

Subcellular debris was eliminated using gradient centrifugation. Aworking solution of 30% w/v OptiPrep Density Gradient Medium(Sigma-Aldrich # D1556) mixed with an equal volume of 1×Hank's BalancedSalt Solution (HBSS; Gibco #14185052) with 0.5% BSA was prepared.Solutions of densities 1.057, 1.043, 1.036, and 1.029 g/ml were thenprepared by combining the working solution with resuspension solution atratios of 0.33:0.67, 0.23:0.77, 0.18:0.82, and 0.13:0.87, respectively.1 ml aliquots of each solution were layered in a 15 ml conical tubebeginning with the densest solution on the bottom. The cell suspensionwas added last to the tube and centrifuged for 20 minutes at 800 g at12° C. The top layer was then aspirated and purified cells were isolatedfrom the remaining layers. These cells were then resuspended in FACSbuffer: 1×HBSS, 2 mM MgCl₂ (Sigma-Aldrich # M4880), 2 mM MgSO₄(Sigma-Aldrich # M2643), 1.25 mM CaCl₂ (Sigma-Aldrich # C7902), 1 mMD-glucose (Sigma-Aldrich # G7021), 0.02% BSA, and 5% v/v trehalose.Cells were centrifuged for 4 minutes at 250 g, the supernatant wasdiscarded, and the pellet was resuspended in FACS buffer by gentlepipetting.

Cells were then sorted based on fluorescence activity. As a gatingcontrol, w cells from cortices injected with AAV9-PB-SRT-tdTomato onlywere analyzed. Cells were then collected from brains transfected withAAV9-PB-SRT-tdTomato and AAV9-HyPBase whose fluorescence values exceedthe gate. After sorting, cells were centrifuged for 3 minutes at 250 g.The supernatant was discarded and cells were resuspended in FACS bufferat a concentration appropriate for 10× Chromium 3′ scRNA-seq librarypreparation.

Example 3: Use of the Line 1 (L1) Transposon for Calling Cards

This example describes the development and use of an inducible LINE1Calling Cards method and the identification of L1-directed SP1 bindingsites (see e.g., FIG. 18-FIG. 24). This example demonstrates thatretrotransposons can be used to map the binding of genome-associatedproteins. FIG. 18 is a schematic depicting how the calling cards methodwas adapted for use with the L1 retrotransposon with DNA recovery. TheSp1 transcription factor was fused to ORF2 of the L1 transposon) in thisconstruct (ORF2 is the L1 protein that is the equivalent of thetransposase. ORF2 then cuts genomic DNA at TF binding sites and copiesthe RNA transposon via its reverse transcriptase activity to insert aDNA copy of the transposon in the genome. The construct shown in FIG. 18is compatible with the DNA recover, so transposon locations are mappedby performing inverse PCR followed by Illumina sequencing. We used thisprotocol to recover undirected and Sp1 directed insertions of the L1element. A genome-wide view of L1 insertions is shown in FIG. 19. Werecovered a roughly equal number of insertions from theSp1-retrotransposase fusion as with the unfused retrotransposase,demonstrating that the fusion does not significantly impairretrotransposase activity (see e.g., FIG. 20, top panel). Furthermore,the Sp1-retrotranspose fusion deposits significantly more transposonsinto promoters, 5′ UTRs and CpG islands, consistent with Sp1's knownbinding preferences for these regions of the genome (see e.g., FIG. 20,bottom panel). Sp1 directed insertion also occur near Sp1 motifs moreoften than expected by chance (see e.g., FIG. 21). FIG. 22 provides anexample of a known Sp1 binding site, demonstrating the concordance ofChIPseq signal, L1 calling cards directed by Sp1, and piggyBac callingcards directed by Sp1. Together, these data demonstrate thatretrotransposon insertions are significantly enriched near Sp1 bindingsites.

Like the piggyBac transposon, the activity of the L1 can be regulated bya chemical inducer molecule, as demonstrated in FIG. 23, wheretranspositions are recovered only in the presence of an inducer moleculewhen a degradation domain is fused to the ORF1 protein of L1.

For this proof-of-principle experiment, we mapped L1 transpositionevents using a DNA-based recovery method. But L1 SRTs can also berecovered from the cellular RNA. To do so, one would perform performingthe following steps, shown in FIG. 24:

-   -   (g) capturing the biotinylated PCR product on        streptavidin-coated magnetic beads    -   (h) optionally, tailing the ends of the PCR product with a        dideoxynucleotide (ddNTP)    -   (i) incubating the PCR products in vitro with Cas9 and guide        RNAs (gRNAs) to specifically cut the unwanted transposon        sequence    -   (j) end repairing, A-tailing, and ligating Y-adapters to the cut        PCR products “on bead”    -   (k) amplifying by PCR with a primer specific for the Y-adapter        and a primer specific to the universal sequence    -   (l) purifying the resulting PCR product and proceeding with step        (iv).

Example 4: Tracing Cellular Lineages with Single Cell Calling Cards(ScCCs)

This example describes how to infer phylogenetic relatedness of terminalcell fates from single cell calling cards data.

Calling cards are permanent and preserved through mitosis. Eachinsertion serves as a marker of clonal identity and different patternsof insertions in terminal cell types can be used to reconstructphylogenetic relationships within a complex tissue.

piggyBac transpositions along a known lineage tree are computationallysimulated and then genealogical relationships between individual cellsare inferred from insertion site information alone. This willdemonstrate feasibility as well as identify the optimal transpositionrate to maximize accuracy of the inferred phylogenetic tree. Using aninducible piggyBac transposase with titratable activity, an in vitrolineage tracing experiment can be performed and the ability ofcomputational methods to reconstruct lineage relationships isdemonstrated in this example, wherein those lineages are connected totranscriptionally distinct terminal cell fates.

Natural transposition events are already used to infer phylogeneticrelationships between species. Herein is described a similar kind ofanalysis but at cellular, instead of geologic, time scales. The singlecell calling cards protocol established in section (I) can be re-usedfor this study. Whereas (I) relies on TF-directed piggyBac transposases,(11) uses wild-type, undirected transposases that can integrate anywherein the genome. Each insertion event can be thought of as alineage-specific barcode; the content and distribution of lineagebarcodes can be used to infer somatic phylogenies. By performing singlecell calling cards on a heterogeneous population of cells, novel celltypes can be identified and their genealogical context reconstructed atthe same time.

Demonstrate Feasibility by Reconstructing Phylogenies from ComputationalSimulations of Single Cell Calling Cards

Herein is described the design of software that can simulate single cellcalling card insertions across a monoclonal population. Every replicaterandomly samples cells from a 20-generation bifurcating genealogy. Thissubset of cells has a known phylogeny. The program then simulatestransposition of self-reporting transposons along the known tree. Forthese purposes, the program assumes there are 1000 self-reporting donorsdistributed in the genome, which is a reasonable upper limit for invitro recording. The program then traverses the known tree and at everycell division, a subset of donors transpose, proportional to themutation rate. After each round of mutation, the transposons are copieddown to each daughter cell, and the process repeats until the end of thetree is reached. The state of the transposons in each descendant serveas discrete markers of identity and are input for a parsimony-based treeinference program. Lastly, the relationship between mutation rate andinference accuracy is plotted to find the optimal transposition rate pergeneration. Data shown in FIG. 25, which were derived from a simplermodel, suggests this kind of analysis can recreate trees with highaccuracy. Reconstruction accuracy is on the y-axis and ranges from 0 to100%, while transposition rate is along the x-axis.

Design Self-Reporting Calling Card Donors Suitable for EstimatingTransposition Rates

With DNA-based calling cards, a fluorescent reporter construct split bya piggyBac donor is used; when transposition occurs, the originalreading frame is restored and fluorescence can be detected. A similareffect can be achieved with RNA by placing a fluorescent protein insidethe self-reporting transposon and a self-cleaving hammerhead ribozymesequence downstream of the terminal repeat. When this construct istransfected without transposase, a unimodal fluorescence distribution isobserved (see e.g., FIG. 26A, left) due to rapid degradation of thetranscript. Addition of a transposase creates a bimodal fluorescencedistribution (see e.g., FIG. 26A, right), which implies genomicintegration of the donor into in a proportion of the cells. By measuringthe percent of cells in the rightmost peak over time (see e.g., FIG.26B) and using Poisson statistics, the rate of transposition per celldivision can be estimated. HEK293T cells are transfected with theself-reporting fluorescent transposon with ribozyme as well as aninducible piggyBac transposase. This cell line is cultured withdifferent concentrations of inducers and transposition rate is estimatedbased on the fraction of cells that show increased fluorescence.

Given the early computational results, it is expected that a morerefined simulation will not only validate feasibility but also showbetter performance across a range of transposition rates. The in vitrolineage tracing experiment should result in a tree that matches theorder in which the populations were split. The tree inferred from the invitro neural differentiation protocol should show the oligodendrocytelineage clustering closer to astrocytes than astrocytes, which wouldcorroborate known lineage relationships during motor neuron development.

Potential Pitfalls, Alternative Approaches, and Future Directions

The piggyBac transposase uses a cut-and-paste mechanism. If thetransposition rate is too high, lineage-specific insertions acquiredearly in differentiation may re-transpose later and shuffle around inthe genome. This increases noise and could hamper phylogeneticreconstruction. The newly reconstructed replicative transposaseHelraiser effectively functions via a copy-paste mechanism. Thus, twocells that had distantly split from a progenitor should share fewinsertions, while very closely related cells should share almost allinsertions. Parsimony-based methods are notoriously slow and may beimpossible to use with thousands of single cell data. Distance-basedinferences, like UPGMA, or probabilistic inference algorithms, can runin less time. Lastly, Drop-seq is limited to collecting ˜10% of a cell'stranscripts, which may cause drop-out of SRTs and loss oflineage-specific information. Methods like FACS-seq could be used, whichsort live cells into individual wells of a 96-well plate. BarcodedRNA-seq library preparation is then performed inside each well. Thistechnique offers greater transcript capture rates than Drop-seq andprovides a more complete representation of a cell's transcriptome.Successful completion of this study should empower investigators withthe tools necessary for reconstructing the developmental history ofcomplex tissues in situ.

1. A self-reporting transposon (SRT) construct comprised of a transposoncontaining at least one promoter element wherein the promoter is capableof driving transcription of RNA through at least one transposon endafter the SRT construct is inserted into genomic DNA, so that a portionof the transposon DNA, at least one transposon end, and the genomic DNAflanking the transposon end is transcribed into RNA.
 2. A method for theinsertion of an SRT construct into a cellular genome wherein the SRTconstruct and either a (i) transposase capable of cutting or copying thetransposon out of the transposon construct and pasting into genomic DNA,or (ii) a genome-associated protein (e.g. a transcription factor, achromatin reader, writer, or eraser), hereafter referred to as GAP,operably linked to a transposase capable of cutting the transposon outof the transposon construct and pasting into genomic DNA are deliveredto cells so that the transposase gene or genome-associatedprotein-transposase fusion is expressed, or can be induced, afterdelivery to the cells. The SRT construct and transposase gene can bedelivered simultaneously or sequentially.
 3. A plasmid encoding the SRTconstruct of claim
 1. 4. The SRT construct of claim 1, wherein thepromoter is capable of transcribing the 3′ region flanking a transposonend comprising a transposon terminal repeat, resulting in an RNAtranscript, wherein the RNA transcript is terminated by a crypticpoly-adenylation (poly-A) signal or picks up a poly-A stretch in thegenome such that the transcript can be recovered by reversetranscription using a poly-T primer.
 5. The SRT construct of claim 1,wherein the promoter is an inducible promoter.
 6. The SRT construct ofclaim 5, wherein the inducible promoter is capable of being induced by achemical inducer, light, or excision of a stop codon or poly-adenylationsignal.
 7. The SRT construct of claim 1, wherein the promoter isselected from the group consisting of, but not restricted to, an EF1αpromoter, CAG promoter, PGK promoter, Tet-on or Tet-off promoter, a T7promoter, or a CMV promoter.
 8. The SRT construct of claim 1, whereinthe promoter drives expression of a reporter gene incorporated in thetransposon.
 9. The SRT construct of claim 8, wherein the reporter geneis selected from the group consisting of: a gene encoding a fluorescentprotein; a gene capable of use as a selectable marker by conferringresistance to a chemical agent that kills eukaryotic or prokaryoticcells; and an enzyme capable of converting a chemical substrate into acolorimetric, luminescent, or fluorescent reporter; and combinationsthereof.
 10. The SRT construct of claim 9, wherein the gene encoding afluorescent protein is selected from, but not restricted to, the groupconsisting of green fluorescent protein, tdTomato, eGFP, and eCFP. 11.The SRT construct of claim 8, wherein the gene capable of use as aselectable marker by conferring resistance to a chemical agent thatkills eukaryotic or prokaryotic cells is selected from the groupconsisting of, but not restricted to, puromycin N-acetyl-transferase,providing resistance to puromycin; either of two aminoglycoside 3′phosphotransferase genes encoded by Tn5 and Tn601 (i.e., a neo gene),providing resistance to G418; or hygromycin phosphotransferase,providing resistance to hygromycin.
 12. The SRT construct of claim 9,wherein the enzyme capable of converting a chemical substrate into acolorimetric, luminescent, or fluorescent reporter selected from thegroup consisting of, but not restricted to, beta-galactosidase orbeta-lactamase, cleaving x-gal, and GeneBLAzer, respectively.
 13. TheSRT construct of claim 1, wherein the RNA transcript produced by thepromoter does not encode a splice donor site between the promoter andthe transposon end.
 14. The SRT construct of claim 1, wherein thetransposon does not contain a poly-adenylation (poly-A) terminationsignal between the promoter and the transposon end.
 15. The SRTconstruct of claim 1, wherein the DNA containing the transposon encodesa self-cleaving ribozyme such as, but not restricted to, the hammerheadribozyme immediately downstream of the transposon end.
 16. The SRTconstruct of claim 1, wherein the construct encodes a bacterial oreukaryotic ribosomal RNA sequence (e.g. 5S, 18S) downstream of thetransposon end.
 17. The SRT construct of claim 1, wherein the transposonencodes a Woodchuck Hepatitis Virus (WHP) Posttranscriptional RegulatoryElement (WPRE).
 18. The method of claim 2 (ii), wherein thegenome-associated protein is a transcription factor, a generaltranscriptional mediator, or a chromatin reader, writer, or eraser. 19.The method of claim 2 (ii), wherein the genome-associated protein andthe transposase are separated by a peptide linker.
 20. The method ofclaim 2 (ii), wherein the DNA-binding protein is selected from the groupconsisting of, but not restricted to, Brd4, Sp1, Hb9, Olig2, Ngn2, Med1,Creb, p53, Usf1, or FoxA2. 21-48. (canceled)